英文域名在哪个网站查询,个人网站建设的目标,呼叫中心系统怎么收费,网站基础建设ppt随着大模型的爆火#xff0c;投入到生产环境的模型参数量规模也变得越来越大#xff08;从数十亿参数到千亿参数规模#xff09;#xff0c;从而导致大模型的推理成本急剧增加。因此#xff0c;市面上也出现了很多的推理框架#xff0c;用于降低模型推理延迟以及提升模型…随着大模型的爆火投入到生产环境的模型参数量规模也变得越来越大从数十亿参数到千亿参数规模从而导致大模型的推理成本急剧增加。因此市面上也出现了很多的推理框架用于降低模型推理延迟以及提升模型吞吐量。
本系列将针对TensorRT-LLM推理进行讲解。本文为该系列第二篇将基于Bloom进行模型量化及推理。 另外我撰写的大模型相关的博客及配套代码均整理放置在Githubllm-action有需要的朋友自取。 环境搭建
基础配置
CUDA12.2镜像nvcr.io/nvidia/pytorch:23.10-py3
由于服务器无法访问外网只能预先准备好镜像安装包、编译源码等接下来准备安装 TensorRT-LLM推荐使用 Docker 构建和运行 TensorRT-LLM整个安装步骤参考 TensorRT-LLM 中构建 Docker 镜像的步骤。
首先进入Docker容器。
docker run -dt --name tensorrt_llm_lgd \
--restartalways \
--gpus all \
--networkhost \
--shm-size4g \
-m 64G \
-v /home/guodong.li/workspace:/workspace \
-w /workspace \
nvcr.io/nvidia/pytorch:23.10-py3 \
/bin/bashdocker exec -it tensorrt_llm_lgd bash安装PyTorch、TensorRT、mpi4py等
# 卸载TensorRT
pip uninstall -y tensorrt
pip uninstall -y torch-tensorrtpip install mpi4py -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com
pip install polygraphy-0.48.1-py2.py3-none-any.whl -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com# 重新安装PyTorch
pip install torch2.1.0 -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com
pip uninstall transformer-engine# 重新安装TensorRT
tar -xf /tmp/TensorRT.tar -C /usr/local/
mv /usr/local/TensorRT-9.1.0.4 /usr/local/tensorrt
pip install /usr/local/tensorrt/python/tensorrt-*-cp310-*.whl -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com配置环境变量/etc/profile
ENV LD_LIBRARY_PATH/usr/local/tensorrt/lib:${LD_LIBRARY_PATH}构建 TensorRT-LLM
python3 ./scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt --cuda_architectures 80-real由于离线构建需修改配置文件
修改pip源https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/scripts/build_wheel.py#L65。修改git远程仓库地址https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tests/CMakeLists.txt#L19
安装TensorRT-LLM
pip install ./build/tensorrt_llm*.whl -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com至此整个环境搭建就完成了。
基于 Bloom 模型开发实践简介
接下来以Bloom模型为例进行 TensorRT-LLM 开发实践。
Bloom 示例中主要文件
build.py用于构建 TensorRT 引擎来运行Bloom模型。run.py模型推理。summarize.py使用模型来总结 CNN Dailymail 数据集中的文章。hf_bloom_convert.py将HF格式的模型进行转换。
TensorRT-LLM 中目前针对 Bloom 模型支持的特性如下
支持 FP16支持 INT8 INT4 仅权重量化支持 INT8 KV CACHE 量化支持SmoothQuant 量化支持张量并行
关于大模型量化之前的文章大模型量化概述 进行过简要概述后续有时间更详细的梳理常见的一些大模型量化技术。
数据与模型下载
下载Bloom模型本文基于bloomz-3b进行量化和推理。
# 需先安装git-lfs通常情况下前面已经安装过了。
# git lfs install# 下载模型
rm -rf /workspace/model/bloomz-3b
mkdir -p /workspace/model/bloomz-3b git clone https://huggingface.co/bigscience/bloomz-3b /workspace/model/bloomz-3b下载数据集本文会用到 CNN Dailymail 数据集和 LAMBADA 数据集。
https://huggingface.co/datasets/ccdv/cnn_dailymailhttps://huggingface.co/datasets/lambada
构建 TensorRT 引擎
TensorRT-LLM 基于 HF 上 Bloom 的 checkpoint 构建 TensorRT 引擎。 如果未指定 checkpoint 目录TensorRT-LLM 将使用虚拟权重构建引擎。
下面使用 build.py 脚本来构建TensorRT 引擎通常build.py 仅需单个 GPU但如果您有推理所需的所有 GPU则可以通过添加 --parallel_build 参数来启用并行构建以使引擎构建过程更快。
注意目前parallel_build功能仅支持单节点。
hf_bloom_convert.py 脚本常用参数说明
out_dir模型格式转化之后的输出路径。in_file原始模型路径。tensor_parallelism模型推理时的张量并行度calibrate_kv_cache生成 KV 缓存的缩放因子。 以 INT8 存储 KV Cache 时使用。smoothquant使用Smoothquant对模型进行量化时设置α参数 并输出int8权重。第一次尝试最好是 0.5。 该参数必须在 [0, 1] 之间。storage_type设置模型参数存储的数据类型。
build.py 脚本常用参数说明
model_dir指定原始HF模型目录。bin_model_dirSmoothQuant 或 KV CACHE 量化时指定模型转换后的二进制文件。dtype指定模型数据类型。use_gemm_plugin设置gemm数据类型。use_gpt_attention_plugin设置attention的数据类型。output_dir引擎输出目录。use_layernorm_plugin设置layernorm的数据类型。use_weight_only设置仅权重量化将各种 GEMM 的权重量化为 INT4/INT8。。weight_only_precision设置仅权重量化时的权重精度。必须使用use_weight_only时该参数才会生效。use_smooth_quant使用 SmoothQuant 方法量化各种 GEMM 的激活和权重。更细粒度的量化选项使用 --per_channel 和 --per_token 参数选型。per_channel默认情况下对 GEMM 结果使用单个静态缩放因子。per_channel 相反它为每个通道使用不同的静态缩放因子。后者通常更准确但速度稍慢。per_token默认情况下我们使用单个静态缩放因子来缩放 int8 范围内的激活。per_token 在运行时为每个token选择一个自定义缩放因子。后者通常更准确但速度稍慢。int8_kv_cache默认情况下使用 dtype 进行 KV 缓存。 int8_kv_cache为KV选择int8量化。use_parallel_embedding默认情况下嵌入并行被禁用。通过设置此参数可以启用嵌入并行。embedding_sharding_dim尝试通过在两层之间共享嵌入查找表来减小引擎大小。注意当不满足条件时该参数可能不会生效。
FP16
使用 HF 权重基于单 GPU 及 float16 精度构建引擎。使用 use_gemm_plugin 来防止准确性问题。
python build.py --model_dir /workspace/model/bloomz-3b \--dtype float16 \--use_gemm_plugin float16 \--use_gpt_attention_plugin float16 \--output_dir /workspace/model/bloomz-3b_trt_engines/fp16/1-gpu/输出模型引擎文件 tree -h /workspace/model/bloomz-3b_trt_engines/fp16/1-gpu/
├── [6.8G] bloom_float16_tp1_rank0.engine
├── [1.2K] config.json
└── [327K] model.cache仅 INT8 权重量化W8A16
使用单 GPU 和仅 INT8 权重量化构建引擎
python build.py --model_dir /workspace/model/bloomz-3b \--dtype float16 \--use_gemm_plugin float16 \--use_gpt_attention_plugin float16 \--use_weight_only \--output_dir /workspace/model/bloomz-3b_trt_engines/int8_weight_only/1-gpu/输出模型引擎文件 tree -h /workspace/model/bloomz-3b_trt_engines/int8_weight_only/1-gpu/├── [4.6G] bloom_float16_tp1_rank0.engine
├── [1.2K] config.json
└── [317K] model.cacheFP16 2路张量并行
使用2路张量并行构建引擎
python build.py --model_dir /workspace/model/bloomz-3b \--dtype float16 \--use_gemm_plugin float16 \--use_gpt_attention_plugin float16 \--output_dir /workspace/model/bloomz-3b_trt_engines/fp16/2-gpu/ \--world_size 2
输出模型引擎文件 tree -h /workspace/model/bloomz-3b_trt_engines/fp16/2-gpu/├── [4.0G] bloom_float16_tp2_rank0.engine
├── [4.0G] bloom_float16_tp2_rank1.engine
├── [1.2K] config.json
└── [327K] model.cache仅 INT8 权重量化 INT8 KV CACHE 量化
下面使用仅 INT8 权重量化及 INT8 KV CACHE 量化
对于 INT8 KV 缓存hf_bloom_convert.py 脚本中有 --calibrate-kv-cache、-kv 选项。设置 -kv 将校准模型然后导出 INT8 KV CACHE推理所需的缩放因子scaling factors。
python3 hf_bloom_convert.py \
-i /workspace/model/bloomz-3b \
-o /workspace/model/bloom-c-model/int8_kv_cache/3b \
--calibrate-kv-cache -t float16输出结果 tree -h /workspace/model/bloom-c-model/int8_kv_cache/3b
/workspace/model/bloom-c-model/int8_kv_cache/3b
└── [ 28K] 1-gpu├── [2.1K] config.ini├── [5.0K] model.final_layernorm.bias.bin├── [5.0K] model.final_layernorm.weight.bin├── [5.0K] model.layers.0.attention.dense.bias.bin├── [ 12M] model.layers.0.attention.dense.weight.0.bin├── [ 15K] model.layers.0.attention.query_key_value.bias.0.bin├── [ 4] model.layers.0.attention.query_key_value.scale_y_quant_orig.bin├── [ 38M] model.layers.0.attention.query_key_value.weight.0.bin├── [5.0K] model.layers.0.input_layernorm.bias.bin├── [5.0K] model.layers.0.input_layernorm.weight.bin├── [5.0K] model.layers.0.mlp.dense_4h_to_h.bias.bin├── [ 50M] model.layers.0.mlp.dense_4h_to_h.weight.0.bin├── [ 20K] model.layers.0.mlp.dense_h_to_4h.bias.0.bin├── [ 50M] model.layers.0.mlp.dense_h_to_4h.weight.0.bin├── [5.0K] model.layers.0.post_attention_layernorm.bias.bin├── [5.0K] model.layers.0.post_attention_layernorm.weight.bin├── [5.0K] model.layers.10.attention.dense.bias.bin├── [ 12M] model.layers.10.attention.dense.weight.0.bin├── [ 15K] model.layers.10.attention.query_key_value.bias.0.bin├── [ 4] model.layers.10.attention.query_key_value.scale_y_quant_orig.bin├── [ 38M] model.layers.10.attention.query_key_value.weight.0.bin├── [5.0K] model.layers.10.input_layernorm.bias.bin├── [5.0K] model.layers.10.input_layernorm.weight.bin├── [5.0K] model.layers.10.mlp.dense_4h_to_h.bias.bin├── [ 50M] model.layers.10.mlp.dense_4h_to_h.weight.0.bin├── [ 20K] model.layers.10.mlp.dense_h_to_4h.bias.0.bin├── [ 50M] model.layers.10.mlp.dense_h_to_4h.weight.0.bin├── [5.0K] model.layers.10.post_attention_layernorm.bias.bin├── [5.0K] model.layers.10.post_attention_layernorm.weight.bin...├── [5.0K] model.word_embeddings_layernorm.bias.bin├── [5.0K] model.word_embeddings_layernorm.weight.bin└── [1.2G] model.wpe.bin组合仅 INT8 权重量化及 INT8 KV CACHE 量化构建引擎
# Build model with both INT8 weight-only and INT8 KV cache enabledpython build.py --bin_model_dir/workspace/model/bloom-c-model/int8_kv_cache/3b/1-gpu \--dtype float16 \--use_gpt_attention_plugin float16 \--use_gemm_plugin float16 \--use_layernorm_plugin \--int8_kv_cache \--output_dir /workspace/model/bloom-3b-c-model/int8_kv_cache/ \--use_weight_only运行结果
tree -h /workspace/model/bloom-3b-c-model/int8_kv_cache/
/workspace/model/bloom-3b-c-model/int8_kv_cache/
├── [4.6G] bloom_float16_tp1_rank0.engine
├── [1.2K] config.json
└── [ 78K] model.cache0 directories, 3 filesSmoothQuant 量化W8A8
与 FP16 构建引擎处理 HF 权重并直接加载到 TensorRT-LLM 不同SmoothQuant 需要加载 INT8 权重该权重应在构建引擎之前进行预处理。
python3 hf_bloom_convert.py \
-i /workspace/model/bloomz-3b \
-o /workspace/model/bloom-3b-c-model/smooth/ \
--smoothquant 0.5 \
--tensor-parallelism 1 \
--storage-type float16 运行结果 tree -h /workspace/model/bloom-3b-c-model/smooth/
/workspace/model/bloom-3b-c-model/smooth/
└── [100K] 1-gpu├── [2.1K] config.ini├── [5.0K] model.final_layernorm.bias.bin├── [5.0K] model.final_layernorm.weight.bin├── [5.0K] model.layers.0.attention.dense.bias.bin├── [ 4] model.layers.0.attention.dense.scale_w_quant_orig.bin├── [ 10K] model.layers.0.attention.dense.scale_w_quant_orig.col.bin├── [ 4] model.layers.0.attention.dense.scale_x_orig_quant.bin├── [ 4] model.layers.0.attention.dense.scale_y_accum_quant.bin├── [ 10K] model.layers.0.attention.dense.scale_y_accum_quant.col.bin├── [ 4] model.layers.0.attention.dense.scale_y_quant_orig.bin├── [ 10K] model.layers.0.attention.dense.smoother.0.bin├── [ 12M] model.layers.0.attention.dense.weight.0.bin├── [6.2M] model.layers.0.attention.dense.weight.int8.0.bin├── [6.2M] model.layers.0.attention.dense.weight.int8.col.0.bin├── [ 15K] model.layers.0.attention.query_key_value.bias.0.bin├── [ 30K] model.layers.0.attention.query_key_value.scale_w_quant_orig.bin├── [ 30K] model.layers.0.attention.query_key_value.scale_w_quant_orig.col.0.bin├── [ 4] model.layers.0.attention.query_key_value.scale_x_orig_quant.bin├── [ 30K] model.layers.0.attention.query_key_value.scale_y_accum_quant.bin├── [ 30K] model.layers.0.attention.query_key_value.scale_y_accum_quant.col.0.bin├── [ 4] model.layers.0.attention.query_key_value.scale_y_quant_orig.bin├── [ 38M] model.layers.0.attention.query_key_value.weight.0.bin├── [ 19M] model.layers.0.attention.query_key_value.weight.int8.0.bin├── [ 19M] model.layers.0.attention.query_key_value.weight.int8.col.0.bin├── [5.0K] model.layers.0.input_layernorm.bias.bin├── [5.0K] model.layers.0.input_layernorm.weight.bin├── [5.0K] model.layers.0.mlp.dense_4h_to_h.bias.bin├── [ 4] model.layers.0.mlp.dense_4h_to_h.scale_w_quant_orig.bin├── [ 10K] model.layers.0.mlp.dense_4h_to_h.scale_w_quant_orig.col.bin├── [ 4] model.layers.0.mlp.dense_4h_to_h.scale_x_orig_quant.bin├── [ 4] model.layers.0.mlp.dense_4h_to_h.scale_y_accum_quant.bin├── [ 10K] model.layers.0.mlp.dense_4h_to_h.scale_y_accum_quant.col.bin├── [ 4] model.layers.0.mlp.dense_4h_to_h.scale_y_quant_orig.bin├── [ 40K] model.layers.0.mlp.dense_4h_to_h.smoother.0.bin├── [ 50M] model.layers.0.mlp.dense_4h_to_h.weight.0.bin├── [ 25M] model.layers.0.mlp.dense_4h_to_h.weight.int8.0.bin├── [ 25M] model.layers.0.mlp.dense_4h_to_h.weight.int8.col.0.bin├── [ 20K] model.layers.0.mlp.dense_h_to_4h.bias.0.bin├── [ 4] model.layers.0.mlp.dense_h_to_4h.scale_w_quant_orig.bin├── [ 40K] model.layers.0.mlp.dense_h_to_4h.scale_w_quant_orig.col.0.bin├── [ 4] model.layers.0.mlp.dense_h_to_4h.scale_x_orig_quant.bin├── [ 4] model.layers.0.mlp.dense_h_to_4h.scale_y_accum_quant.bin├── [ 40K] model.layers.0.mlp.dense_h_to_4h.scale_y_accum_quant.col.0.bin├── [ 4] model.layers.0.mlp.dense_h_to_4h.scale_y_quant_orig.bin├── [ 50M] model.layers.0.mlp.dense_h_to_4h.weight.0.bin├── [ 25M] model.layers.0.mlp.dense_h_to_4h.weight.int8.0.bin├── [ 25M] model.layers.0.mlp.dense_h_to_4h.weight.int8.col.0.bin├── [5.0K] model.layers.0.post_attention_layernorm.bias.bin├── [5.0K] model.layers.0.post_attention_layernorm.weight.bin...├── [5.0K] model.word_embeddings_layernorm.bias.bin├── [5.0K] model.word_embeddings_layernorm.weight.bin└── [1.2G] model.wpe.bin
通过 --use_smooth_quant 选型启动 INT8 量化。默认情况下使用逐层量化_per_tensor_构建引擎
# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --bin_model_dir/workspace/model/bloom-3b-c-model/smooth/1-gpu \--use_smooth_quant \--output_dir /workspace/model/bloom-3b-c-model/smooth-quant \--use_gpt_attention_plugin float16
运行结果 tree -h /workspace/model/bloom-3b-c-model/smooth-quant
/workspace/model/bloom-3b-c-model/smooth-quant
├── [3.4G] bloom_float16_tp1_rank0.engine
├── [1.2K] config.json
└── [516K] model.cache0 directories, 3 files同时支持使用逐通道量化 _per_token_ _per_channel_构建引擎
# Build model for SmoothQuant in the _per_token_ _per_channel_ mode
python3 build.py --bin_model_dir/workspace/model/bloom-3b-c-model/smooth/1-gpu \--use_smooth_quant \--use_gpt_attention_plugin float16 \--output_dir /workspace/model/bloom-3b-c-model/smooth-quant-channel-token \--per_token \--per_channel运行结果
tree -h /home/guodong.li/workspace/model/bloom-3b-c-model/smooth-quant-channel-token
/home/guodong.li/workspace/model/bloom-3b-c-model/smooth-quant-channel-token
├── [4.6G] bloom_float16_tp1_rank0.engine
├── [1.2K] config.json
└── [516K] model.cache0 directories, 3 files注意
目前需要为 SmoothQuant 启用 GPT 注意力插件--use_gpt_attention_plugin。使用 --bin_model_dir 而不是 --model_dir是因为 SmoothQuant 量化时模型需要二进制文件中的 INT8 权重和各种缩放scales。
模型推理
接下来运行模型进行推理同时使用rouge指标评估模型。
summarize.py 脚本常用参数说明
hf_model_location指定HF模型和词表地址test_hf测试HFtest_trt_llm测试TensorRT-LLMdata_type指定数据类型该参数指定test_hf时使用将模型参数转换成半精度dataset_path指定数据集缓存目录engine_dir指定引擎目录
FP16
python summarize.py --test_trt_llm \--hf_model_location /workspace/model/bloomz-3b \--data_type fp16 \--engine_dir /workspace/model/bloomz-3b_trt_engines/fp16/1-gpu/仅 INT8 权重量化
python summarize.py --test_trt_llm \--hf_model_location /workspace/model/bloomz-3b \--data_type fp16 \--engine_dir /workspace/model/bloomz-3b_trt_engines/int8_weight_only/1-gpu/运行过程
[11/14/2023-09:54:48] [TRT-LLM] [I] Load tokenizer takes: 0.6626021862030029 sec
[11/14/2023-09:54:54] [TRT] [I] Loaded engine size: 4708 MiB
[11/14/2023-09:54:55] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 6142, GPU 46624 (MiB)
[11/14/2023-09:54:55] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 2, GPU 10, now: CPU 6144, GPU 46634 (MiB)
[11/14/2023-09:54:55] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:54:55] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU 0, GPU 4703, now: CPU 0, GPU 4703 (MiB)
[11/14/2023-09:54:55] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 6149, GPU 48652 (MiB)
[11/14/2023-09:54:55] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 0, GPU 8, now: CPU 6149, GPU 48660 (MiB)
[11/14/2023-09:54:55] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:54:56] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4703 (MiB)
[11/14/2023-09:54:56] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 6195, GPU 48680 (MiB)
[11/14/2023-09:54:56] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 1, GPU 10, now: CPU 6196, GPU 48690 (MiB)
[11/14/2023-09:54:56] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:54:57] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4703 (MiB)
[11/14/2023-09:54:58] [TRT-LLM] [I] Load engine takes: 9.880424976348877 sec
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
[11/14/2023-09:54:59] [TRT-LLM] [I] ---------------------------------------------------------
[11/14/2023-09:54:59] [TRT-LLM] [I] TensorRT-LLM Generated :
[11/14/2023-09:54:59] [TRT-LLM] [I] Article : [(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TVs The Dukes of Hazzard, died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although hed been a busy actor for decades in theater and in Hollywood, Best didnt become famous until 1979, when The Dukes of Hazzards cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Bests Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his hot pursuit usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive kew-kew-kew chuckle and for goofy catchphrases such as cuff em and stuff em! upon making an arrest. Among the most popular shows on TV in the early 80s, The Dukes of Hazzard ran until 1985 and spawned TV movies, an animated series and video games. Several of Bests Hazzard co-stars paid tribute to the late actor on social media. I laughed and learned more from Jimmie in one hour than from anyone else in a whole year, co-star John Schneider, who played Bo Duke, said on Twitter. Give Uncle Jesse my love when you see him dear friend. Jimmy Best was the most constantly creative person I have ever known, said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his lifes many passions. Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as The Twilight Zone, Bonanza, The Andy Griffith Show and Gunsmoke. He later appeared in a handful of Burt Reynolds movies, including Hooper and The End. But Best will always be best known for his Hazzard role, which lives on in reruns. Jimmie was my teacher, mentor, close friend and collaborator for 26 years, Latshaw said. I directed two of his feature films, including the recent Return of the Killer Shrews, a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier. People weve lost in 2015 . CNNs Stella Chan contributed to this story.]
[11/14/2023-09:54:59] [TRT-LLM] [I]Highlights : [James Best, who played the sheriff on The Dukes of Hazzard, died Monday at 88 .\nHazzard ran from 1979 to 1985 and was among the most popular shows on TV .]
[11/14/2023-09:54:59] [TRT-LLM] [I]Summary : [[ Actor James Best, best known for his role as bumbling sheriff Rosco P. Coltrane on TVs The Dukes of Hazzard, has died at age 88.]]
[11/14/2023-09:54:59] [TRT-LLM] [I] ---------------------------------------------------------
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
[11/14/2023-09:55:10] [TRT-LLM] [I] TensorRT-LLM (total latency: 10.436434745788574 sec)
[11/14/2023-09:55:10] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[11/14/2023-09:55:11] [TRT-LLM] [I] rouge1 : 30.60846842935061
[11/14/2023-09:55:11] [TRT-LLM] [I] rouge2 : 11.315593160478784
[11/14/2023-09:55:11] [TRT-LLM] [I] rougeL : 24.043680494718327
[11/14/2023-09:55:11] [TRT-LLM] [I] rougeLsum : 26.250663629946125FP16 2路张量并行
mpirun -n 2 --allow-run-as-root \python summarize.py --test_trt_llm \--hf_model_location /workspace/model/bloomz-3b \--data_type fp16 \--engine_dir /workspace/model/bloomz-3b_trt_engines/fp16/2-gpu/运行过程
[11/14/2023-09:58:13] [TRT-LLM] [MPI_Rank 1] [I] Load tokenizer takes: 0.4274311065673828 sec
[11/14/2023-09:58:13] [TRT-LLM] [MPI_Rank 0] [I] Load tokenizer takes: 0.45519232749938965 sec
[11/14/2023-09:58:17] [TRT] [I] Loaded engine size: 4094 MiB
[11/14/2023-09:58:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 5533, GPU 41994 (MiB)
[11/14/2023-09:58:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 1, GPU 10, now: CPU 5534, GPU 42004 (MiB)
[11/14/2023-09:58:18] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:19] [TRT] [I] Loaded engine size: 4094 MiB
[11/14/2023-09:58:20] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 5529, GPU 46010 (MiB)
[11/14/2023-09:58:20] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 1, GPU 10, now: CPU 5530, GPU 46020 (MiB)
[11/14/2023-09:58:20] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU 0, GPU 4088, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU 0, GPU 4088, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 5749, GPU 43220 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 0, GPU 8, now: CPU 5749, GPU 43228 (MiB)
[11/14/2023-09:58:23] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 5749, GPU 47236 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 0, GPU 8, now: CPU 5749, GPU 47244 (MiB)
[11/14/2023-09:58:23] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 1, GPU 8, now: CPU 5796, GPU 47262 (MiB)
[11/14/2023-09:58:23] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 0, GPU 10, now: CPU 5796, GPU 47272 (MiB)
[11/14/2023-09:58:23] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:24] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:24] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU 0, GPU 8, now: CPU 5796, GPU 43246 (MiB)
[11/14/2023-09:58:24] [TRT] [I] [MemUsageChange] Init cuDNN: CPU 0, GPU 10, now: CPU 5796, GPU 43256 (MiB)
[11/14/2023-09:58:24] [TRT] [W] TensorRT was linked against cuDNN 8.9.4 but loaded cuDNN 8.9.2
[11/14/2023-09:58:24] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:24] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU 0, GPU 0, now: CPU 0, GPU 4088 (MiB)
[11/14/2023-09:58:25] [TRT-LLM] [MPI_Rank 0] [I] Load engine takes: 11.81023645401001 sec
[11/14/2023-09:58:25] [TRT-LLM] [MPI_Rank 1] [I] Load engine takes: 11.762826204299927 sec
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I] ---------------------------------------------------------
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM Generated :
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I] Article : [(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TVs The Dukes of Hazzard, died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although hed been a busy actor for decades in theater and in Hollywood, Best didnt become famous until 1979, when The Dukes of Hazzards cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Bests Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his hot pursuit usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive kew-kew-kew chuckle and for goofy catchphrases such as cuff em and stuff em! upon making an arrest. Among the most popular shows on TV in the early 80s, The Dukes of Hazzard ran until 1985 and spawned TV movies, an animated series and video games. Several of Bests Hazzard co-stars paid tribute to the late actor on social media. I laughed and learned more from Jimmie in one hour than from anyone else in a whole year, co-star John Schneider, who played Bo Duke, said on Twitter. Give Uncle Jesse my love when you see him dear friend. Jimmy Best was the most constantly creative person I have ever known, said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his lifes many passions. Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as The Twilight Zone, Bonanza, The Andy Griffith Show and Gunsmoke. He later appeared in a handful of Burt Reynolds movies, including Hooper and The End. But Best will always be best known for his Hazzard role, which lives on in reruns. Jimmie was my teacher, mentor, close friend and collaborator for 26 years, Latshaw said. I directed two of his feature films, including the recent Return of the Killer Shrews, a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier. People weve lost in 2015 . CNNs Stella Chan contributed to this story.]
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I]Highlights : [James Best, who played the sheriff on The Dukes of Hazzard, died Monday at 88 .\nHazzard ran from 1979 to 1985 and was among the most popular shows on TV .]
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I]Summary : [[ Actor James Best, best known for his role as bumbling sheriff Rosco P. Coltrane on TVs The Dukes of Hazzard, has died at age 88.]]
[11/14/2023-09:58:27] [TRT-LLM] [MPI_Rank 0] [I] ---------------------------------------------------------
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
/workspace/TensorRT-LLM/examples/bloom/summarize.py:165: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).[torch.tensor(line_encoded[i], dtypetorch.int32), pad],
[11/14/2023-09:58:42] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM (total latency: 14.928563356399536 sec)
[11/14/2023-09:58:42] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result
[11/14/2023-09:58:43] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 27.12991734291884
[11/14/2023-09:58:43] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 8.273487794146279
[11/14/2023-09:58:43] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 21.08356714989421
[11/14/2023-09:58:43] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 23.51165220383353SmoothQuant 量化
逐层量化
python summarize.py --test_trt_llm \--hf_model_location /workspace/model/bloomz-3b \--data_type fp16 \--engine_dir /workspace/model/bloom-3b-c-model/smooth-quant逐通道量化
python summarize.py --test_trt_llm \--hf_model_location /workspace/model/bloomz-3b \--data_type fp16 \--engine_dir /workspace/model/bloom-3b-c-model/smooth-quant-channel-token总结
本文简要介绍了TensorRT-LLM环境搭建同时基于Bloom进行模型量化及推理。码字不易如果觉得有帮助欢迎点赞收藏加关注。
参考文档
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.5.0https://github.com/NVIDIA/TensorRT-LLM/blob/v0.5.0/docker/Dockerfile.multihttps://github.com/NVIDIA/TensorRT-LLM/blob/v0.5.0/docs/source/installation.md