当前位置：首页 > news >正文

哈尔滨做网站电话宁波有几个区

news 2025/11/15 8:37:10

哈尔滨做网站电话,宁波有几个区,jsp网站访问万维网,陕西煤炭建设公司网站1 相关预备知识模型#xff1a;包含了大量参数的一个网络#xff08;参数结构#xff09;#xff0c;体积10MB-10GB不等。模型格式#xff1a;相同的模型可以有不同的存储格式#xff08;可类比音视频文件#xff09;#xff0c;目前主流有torch、tf、onnx和trt…1 相关预备知识模型包含了大量参数的一个网络参数结构体积10MB-10GB不等。模型格式相同的模型可以有不同的存储格式可类比音视频文件目前主流有torch、tf、onnx和trt其中tf又包含了三种格式。模型推理输入和网络中的参数进行各种运算从而得到一个输出计算密集型任务且需要GPU加速。模型推理引擎模型推理工具可以让模型推理速度变快使用该工具往往需要特定的模型格式现在主流推理引擎有trt和ort。模型推理框架对模型推理的过程进行了封装使之新增、删除、替换模型更方便更高级的框架还会有负载均衡、模型监控、自动生成grpc、http接口等功能就是为部署而生的。接下来要介绍的triton就是目前比较优秀的一个模型推理框架。 2 入门triton 接下来手把手教你跑通triton让你明白triton到底是干啥的。 2.1 注册NGC平台 NGC可以理解是NV的一个官方软件仓库里面有好多编译好的软件、docker镜像等。我们要注册NGC并生成相应的api key这个api key用于在docker上登录ngc并下载里面的镜像。 NGC 官方网站的地址是: https://ngc.nvidia.com 注册和登录后你可以在用户界面(profile UI)中生成 API 关键字(key)。这个 key 允许你在 Docker 或其他适当的地方登录 NGC并从该仓库下载 Docker 镜像等资源。 2.2 登录命令行界面输入 docker login nvcr.io 然后输入用户名和你上一步生成的key用户名就是$oauthtoken不要忘记$符号不要使用自己的用户名。最后会出现Login Succeeded字样就代表登录成功了。 2.3 拉取镜像 docker pull nvcr.io/nvidia/tritonserver:22.04-py3 你也可以选择拉取其他版本的triton。镜像大概有几个G需耐心等待这个镜像不区分gpu和cpu是通用的。 2.4 构建模型目录执行命令mkdir -p /home/triton/model_repository/fc_model_pt/1。其中/home/triton/model_repository就是你的模型仓库所有的模型都在这个模型目录中。启动容器时会将其映射到容器中的/model文件夹上fc_model_pt可以理解为是某一个模型的存放目录比如一个用于情感分类的模型名字则没有要求最好见名知义1代表版本是1 模型仓库的目录结构如下 model-repository-path/# 模型仓库目录model-name/ # 模型名字[config.pbtxt] # 模型配置文件[output-labels-file ...] # 标签文件可以没有version/ # 该版本下的模型model-definition-fileversion/model-definition-file...model-name/[config.pbtxt][output-labels-file ...]version/model-definition-fileversion/model-definition-file......2.5 生成一个用于推理的torch模型创建一个torch模型并使用torchscript保存 import torch import torch.nn as nnclass SimpleModel(nn.Module):def __init__(self):super(SimpleModel, self).__init__()self.embedding nn.Embedding(100, 8)self.fc nn.Linear(8, 4)self.fc_list nn.Sequential(*[nn.Linear(8, 8) for _ in range(4)])def forward(self, input_ids):word_emb self.embedding(input_ids)output1 self.fc(word_emb)output2 self.fc_list(word_emb)return output1, output2if __name__ __main__:model SimpleModel()ipt torch.tensor([[1, 2, 3], [4, 5, 6]], dtypetorch.long)script_model torch.jit.trace(model, ipt, strictTrue)torch.jit.save(script_model, model.pt)// 这段代码定义了一个简单的 PyTorch 模型然后使用一个输入 Tensor 来在严格模式下 Trace 该模型并将 Trace 后的模型保存为 model.pt 文件。生成模型后拷贝到刚才建立的目录中,注意要放到版本号对应的目录且模型文件名必须是model.pt。 triton支持很多模型格式这里只是拿了torch举例。 2.6 编写配置文件为了让框架识别到刚放入的模型我们需要编写一个配置文件config.pbtxt具体内容如下 name: fc_model_pt # 模型名也是目录名 platform: pytorch_libtorch # 模型对应的平台本次使用的是torch不同格式的对应的平台可以在官方文档找到 max_batch_size : 64 # 一次送入模型的最大bsz防止oom input [{name: input__0 # 输入名字对于torch来说名字于代码的名字不需要对应但必须是name__index的形式注意是2个下划线写错就报错data_type: TYPE_INT64 # 类型torch.long对应的就是int64,不同语言的tensor类型与triton类型的对应关系可以在官方文档找到dims: [ -1 ] # -1 代表是可变维度,虽然输入是二维的但是默认第一个是bsz,所以只需要写后面的维度就行(无法理解的操作如果是[-1,-1]调用模型就报错)} ] output [{name: output__0 # 命名规范同输入data_type: TYPE_FP32dims: [ -1, -1, 4 ]},{name: output__1data_type: TYPE_FP32dims: [ -1, -1, 8 ]} ] 这个模型配置文件估计是整个triton最复杂的地方上线模型的大部分工作估计都在写配置文件我这边三言两语难以解释清楚只能给大家简单介绍下具体内容还请参考官方文档。注意配置文件不要放到版本号的目录里放到模型目录里也就是说config.pbtxt和版本号目录是平级的。关于输入shape默认第一个是bsz的官方说明 As discussed above, Triton assumes that batching occurs along the first dimension which is not listed in in the input or output tensor dims. However, for shape tensors, batching occurs at the first shape value. For the above example, an inference request must provide inputs with the following shapes. 2.7 创建容器并启动执行命令 docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 \-v /home/triton/model_repository/:/models \nvcr.io/nvidia/tritonserver:22.04-py3 \tritonserver \--model-repository/models 如果你的系统有一个可用的gpu那你可以加上如下参数 --gpus1来让推理框架使用GPU加速这个参数要放到run的后面。 2.8 测试接口如果你按照我的教程一步一步走下去那么肯定可以成功启动容器下面我们可以写段代码测试下接口是否是通的调用地址是http:\\localhost:8000/v2/models/{model_name}/versions/{version}/infer 测试代码如下 import requestsif __name__ __main__:request_data {inputs: [{name: input__0,shape: [2, 3],datatype: INT64,data: [[1, 2, 3], [4,5,6]]}],outputs: [{name: output__0}, {name: output__1}]}res requests.post(urlhttp://localhost:8000/v2/models/fc_model_pt/versions/1/infer,jsonrequest_data).json()print(res)// 这段代码使用 Python 的 requests 库向本地运行的服务器发送 POST 请求。服务器托管了一个模型即 fc_model_pt 。POST 请求包含输入数据和所期望的输出数据格式并打印服务器的响应结果。执行代码后会得到相应的输出 {model_name: fc_model_pt,model_version: 1,outputs: [{name: output__0,datatype: FP32,shape: [2, 3, 4],data: [1.152763843536377, 1.1349767446517944, -0.6294105648994446, 0.8846281170845032, 0.059508904814720154, -0.06066855788230896, -1.497096061706543, -1.192716121673584, 0.7339693307876587, 0.28189709782600403, 0.3425392210483551, 0.08894850313663483, 0.48277992010116577, 0.9581012725830078, 0.49371692538261414, -1.0144696235656738, -0.03292369842529297, 0.3465275764465332, -0.5444514751434326, -0.6578375697135925, 1.1234807968139648, 1.1258794069290161, -0.24797165393829346, 0.4530307352542877]}, {name: output__1,datatype: FP32,shape: [2, 3, 8],data: [-0.28994596004486084, 0.0626179575920105, -0.018645435571670532, -0.3376324474811554, -0.35003775358200073, 0.2884367108345032, -0.2418503761291504, -0.5449661016464233, -0.48939061164855957, -0.482677698135376, -0.27752232551574707, -0.26671940088272095, -0.2171783447265625, 0.025355860590934753, -0.3266356587409973, -0.06301657110452652, -0.1746724545955658, -0.23552510142326355, 0.10773542523384094, -0.4255935847759247, -0.47757795453071594, 0.4816707670688629, -0.16988427937030792, -0.35756853222846985, -0.06549499928951263, -0.04733048379421234, -0.035484105348587036, -0.4210450053215027, -0.07763291895389557, 0.2223128080368042, -0.23027443885803223, -0.4195460081100464, -0.21789231896400452, -0.19235755503177643, -0.16810789704322815, -0.34017443656921387, -0.05121977627277374, 0.08258339017629623, -0.2775516211986542, -0.27720844745635986, -0.25765007734298706, -0.014576494693756104, 0.0661710798740387, -0.38623639941215515, -0.45036202669143677, 0.3960753381252289, -0.20757021009922028, -0.511818528175354]}] }不知道是不是我的用法有问题就使用体验来看这个推理接口有些让我不适应明明在config.pbtxt里指定了datatype,但是输入的时候还需要指定不指定就会报错输入的shape也需要指定那不然也会报错datatype的值和config.pbtxt里不统一如果datatype设为TYPE_INT64则会报错必须为INT64输出里的data是个1维数组需要根据返回的shape自动reshape成对应的array 除了像我这样直接写代码调用还可以使用他们提供的官方库pip install tritonclient[http],地址如下https://github.com/triton-inference-server/client。 3 使用triton的高级特性上一小节的教程只是用到了triton的基本功能所以段位只能说是个黄金下面介绍下一些triton的高级特性。 3.1 模型并行模型并行可以指同时启动多个模型或单个模型的多个实例。实现起来并不复杂通过修改配置参数就可以。在默认情况下triton会在每个可用的gpu上都部署一个该模型的实例从而实现并行。接下来我会对多种情况进行测试以让你们清楚模型并行所带来的效果本人的配置是2块3060(租的)用于测试多模型的情况。压测命令使用ab -k -c 5 -n 500 -p ipt.json http://localhost:8000/v2/models/fc_model_pt/versions/1/infer 这条命令的意思是5个进程反复调用接口共500次。测试配置及对应的QPS如下共1个卡每个卡运行1个实例QPS为603共2个卡每个卡运行1个实例QPS为1115共2个卡每个卡运行2个实例QPS为1453共2个卡每个卡运行2个实例同时在CPU上放2个实例QPS为972 结论如下多卡性能有提升多个实例能进一步提升并发能力加入CPU会拖累速度主要是因为CPU速度太慢。下面是上述测试对应的配置项直接复制了放到config.pbtxt中就行 #共2个卡每个卡运行2个实例 instance_group [ {count: 2kind: KIND_GPUgpus: [ 0 ] }, {count: 2kind: KIND_GPUgpus: [ 1 ] } ] # 共2个卡每个卡运行2个实例同时在CPU上放2个实例 instance_group [ {count: 2kind: KIND_GPUgpus: [ 0 ] }, {count: 2kind: KIND_GPUgpus: [ 1 ] }, {count: 2kind: KIND_CPU } ]至于选择使用几张卡则通过创建容器时的--gpus来指定 3.2 动态batch 动态batch的意思是指, 对于一个请求,先不进行推理,等个几毫秒把这几毫秒的所有请求拼接成一个batch进行推理这样可以充分利用硬件提升并行能力当然缺点就是个别用户等待时间变长不适合低频次请求的场景。使用动态batch很简单只需要在config.pbtxt加上 dynamic_batching { }具体参数细节大家可以去看文档我的这种简单写法组成的bsz上限就是max_batch_size本人压测的结果是约有50%QPS提升反正就是有效果就对了。 PS这种优化方式对于压测来说简直就是作弊。。。 3.3 自定义backend 所谓自定义backend就是自己写推理过程正常情况下整个推理过程是通过模型直接解决的但是有一些推理过程还会包含一些业务逻辑比如整个推理过程需要2个模型其中要对第一个模型的输出结果做一些逻辑判断然后修改输出才能作为第二个模型的输入最简单的做法就是我们调用两次triton服务先调用第一个模型获取输出然后进行业务逻辑判断和修改然后再调用第二个模型。不过在triton中我们可以自定义一个backend把整个调用过程写在里面这样就简化调用过程同时也避免了一部分http传输时延。我举的例子其实是一个包含了业务逻辑的pipline这种做法NV叫做BLS(Business Logic Scripting)。要实现自定义的backend也很简单与上文讲的放torch模型流程基本一样首先建立模型文件夹然后在文件夹里新建config.pbtxt,然后新建版本文件夹然后放入model.py这个py文件里就写了推理过程。为了说明目录结构我把构建好的模型仓库目录树打印出来展示一下 model_repository/ # 模型仓库|-- custom_model # 我们的自定义backend模型目录| |-- 1 # 版本| | |-- model.py # 模型Py文件里面主要是推理的逻辑| -- config.pbtxt # 配置文件-- fc_model_pt # 上一小节介绍的模型|-- 1| -- model.pt-- config.pbtxt 如果上一小节你看明白了那么你就会发现自定义backend的模型目录设置和正常目录设置是一样的唯一不同的就是模型文件由网络权重变成了自己写的代码而已。下面说下config.pbtxt和model.py的文件内容大家可以直接复制粘贴 import json import numpy as np import triton_python_backend_utils as pb_utilsclass TritonPythonModel:Your Python model must use the same class name. Every Python modelthat is created must have TritonPythonModel as the class name.def initialize(self, args):initialize is called only once when the model is being loaded.Implementing initialize function is optional. This function allowsthe model to intialize any state associated with this model.Parameters----------args : dictBoth keys and values are strings. The dictionary keys and values are:* model_config: A JSON string containing the model configuration* model_instance_kind: A string containing model instance kind* model_instance_device_id: A string containing model instance device ID* model_repository: Model repository path* model_version: Model version* model_name: Model name# You must parse model_config. JSON string is not parsed hereself.model_config model_config json.loads(args[model_config])# Get output__0 configurationoutput0_config pb_utils.get_output_config_by_name(model_config, output__0)# Get output__1 configurationoutput1_config pb_utils.get_output_config_by_name(model_config, output__1)# Convert Triton types to numpy typesself.output0_dtype pb_utils.triton_string_to_numpy(output0_config[data_type])self.output1_dtype pb_utils.triton_string_to_numpy(output1_config[data_type])def execute(self, requests):requests : listA list of pb_utils.InferenceRequestReturns-------listA list of pb_utils.InferenceResponse. The length of this list mustbe the same as requestsoutput0_dtype self.output0_dtypeoutput1_dtype self.output1_dtyperesponses []# Every Python backend must iterate over everyone of the requests# and create a pb_utils.InferenceResponse for each of them.for request in requests:# 获取请求数据in_0 pb_utils.get_input_tensor_by_name(request, input__0)# 第一个输出结果自己随便造一个假的就假装是有逻辑了out_0 np.array([1, 2, 3, 4, 5, 6, 7, 8]) # 为演示方便先写死out_tensor_0 pb_utils.Tensor(output__0, out_0.astype(output0_dtype))# 第二个输出结果调用fc_model_pt获取结果inference_request pb_utils.InferenceRequest(model_namefc_model_pt,requested_output_names[output__0, output__1],inputs[in_0])inference_response inference_request.exec()out_tensor_1 pb_utils.get_output_tensor_by_name(inference_response, output__1)inference_response pb_utils.InferenceResponse(output_tensors[out_tensor_0, out_tensor_1])responses.append(inference_response)return responsesdef finalize(self):finalize is called only once when the model is being unloaded.Implementing finalize function is OPTIONAL. This function allowsthe model to perform any necessary clean ups before exit.print(Cleaning up...)// 这段代码是一个 Python 版本的 Triton 后端模型的基本框架。模型的 initialize 方法在加载模型时被调用一次execute 方法在收到推理请求时被调用并返回服务器的响应finalize 在模型被卸载时调用可选的实现清理工作。 config.pbtxt的文件内容 name: custom_model backend: python input [{name: input__0data_type: TYPE_INT64dims: [ -1, -1 ]} ] output [{name: output__0 data_type: TYPE_FP32dims: [ -1, -1, 4 ]},{name: output__1data_type: TYPE_FP32dims: [ -1, -1, 8 ]} ] 待上述工作都完成后就可以启动程序查看运行结果了大家可以直接复制我的代码进行测试 import requestsif __name__ __main__:request_data {inputs: [{name: input__0,shape: [1, 2],datatype: INT64,data: [[1, 2]]}],outputs: [{name: output__0}, {name: output__1}]}res requests.post(urlhttp://localhost:8000/v2/models/fc_model_pt/versions/1/infer,jsonrequest_data).json()print(res)res requests.post(urlhttp://localhost:8000/v2/models/custom_model/versions/1/infer,jsonrequest_data).json()print(res)// 这段代码使用 Python 的requests库向本地运行的服务器发送 POST 请求。服务器托管了两个模型一个是fc_model_pt另一个是custom_model。POST 请求包含输入数据和期望的输出数据格式。然后将服务器的响应结果打印出来。运行结果可以发现2次的输出中ouput__0是不一样而output__1是一样的这和咱们写的model.py的逻辑有关系我这里就不多解释了。 PS自定义backend避免了需要反复调用NLG模型进行生成而产生的传输时延

查看全文

http://www.zqtcl.cn/news/711685/