天津城建设计院网站,肇庆网络,wordpress code theme,重庆蒲公英网站建设公司怎么样概述#xff1a;
人工智能大语言模型是近年来人工智能领域的一项重要技术#xff0c;它的出现标志着自然语言处理领域的重大突破。这些模型利用深度学习和大规模数据训练#xff0c;能够理解和生成人类语言#xff0c;为各种应用场景提供了强大的文本处理能力。AI大语言模…概述
人工智能大语言模型是近年来人工智能领域的一项重要技术它的出现标志着自然语言处理领域的重大突破。这些模型利用深度学习和大规模数据训练能够理解和生成人类语言为各种应用场景提供了强大的文本处理能力。AI大语言模型的技术原理主要基于深度学习和自然语言处理技术通过自监督学习和大规模文本数据的预训练来学习语言的表示。训练完成后可以通过微调等方法将模型应用于特定的任务和应用场景。 未来AI大语言模型有望在更多领域发挥作用包括自然语言理解、文本生成、对话系统、语言翻译等。它们可以用于自动摘要、文档生成、智能客服、智能问答等多种应用场景为用户提供了更加智能和个性化的服务。 本文为学习大语言模型及FastGPT部署的学习笔记。通过直接部署**ChatGML3大语言模型**或**OLLAMA模型管理工具**配合FastGPT私有化搭建知识库。其中**one-api**、**fastgpt**是两种方法都需要部署的其他的更建议使用ollama直接进行部署切换模型方便快捷易于管理。
硬件要求
以下配置仅作参考 **chatglm3-6bm3e**3060 12 ↑ **qwen:4bm3e**3060 12 ↑ **qwen:2bm3e**1660 6g↑ 总结模型量级越大所需显卡性能越高过小的量级的大模型在低端cpu亦可运行只是推理的精准度很差更不能配合m3e向量模型进行推理速度会非常慢。
本文档涉及到的资源
conda https://www.anaconda.com/ Conda 是一个运行在 Windows、MacOS 和 Linux 上的开源包管理系统和环境管理系统。Conda 可以 快速安装、运行和更新包及其依赖项 轻松地在本地计算机上创建、保存、加载和切换环境 它是为 Python 程序创建的但它可以为任何语言打包和分发软件。 简单总结一下就是 Conda 很好、很强大使用 Conda 会让你很省心。人生苦短我选 “Conda”! one-api https://github.com/songquanpeng/one-api All in one 的 OpenAI 接口 整合各种 API 访问方式 一键部署开箱即用 chatglm3 https://github.com/THUDM/ChatGLM3 ChatGLM3 是智谱 AI 和清华大学 KEG 实验室联合发布的对话预训练模型。 m3e https://modelscope.cn/models/Jerry0/m3e-base/summary M3E 是 Moka Massive Mixed Embedding 的缩写 Moka此模型由 MokaAI 训练开源和评测训练脚本使用 uniem 评测 BenchMark 使用 MTEB-zhMassive此模型通过千万级 (2200w) 的中文句对数据集进行训练Mixed此模型支持中英双语的同质文本相似度计算异质文本检索等功能未来还会支持代码检索Embedding此模型是文本嵌入模型可以将自然语言转换成稠密的向量 ollama https://ollama.com/ 大语言模型管理工具 fastgpt https://github.com/labring/FastGPT FastGPT 是一个基于 LLM 大语言模型的知识库问答系统提供开箱即用的数据处理、模型调用等能力。同时可以通过 Flow 可视化进行工作流编排从而实现复杂的问答场景 魔搭社区 https://modelscope.cn/home ModelScope旨在打造下一代开源的模型即服务共享平台为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品让模型应用更简单 CONDA的使用
官网https://www.anaconda.com/
安装好后更新工具
conda update -n base -c defaults conda
更新各种库
conda update --all
创建虚拟环境
conda create --name windows_chatglm3-6b python3.11 -y
激活并进入虚拟环境
conda activate windows_chatglm3-6b
安装pytorch
cmd:nvidia-smi查看最高支持的 CUDA Version我的是12.2
安装pytorch PyTorch是一个开源的Python机器学习库基于Torch用于自然语言处理等应用程序。PyTorch既可以看作加入了GPU支持的numpy同时也可以看成一个拥有自动求导功能的强大的深度神经网络 。 在 https://pytorch.org/get-started/locally/ 查询自己电脑需要执行的命令在conda虚拟环境内执行以下命令安装pytorch
conda install pytorch torchvision torchaudio pytorch-cuda12.1 -c pytorch -c nvidia
chatglm3环境搭建(非ollama模式)
模型及DEMO下载
一共需要下载两个模型chatglm3 及m3e
chatglm3下载
模型地址https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary 下载方法git 下载时间较久耐心等待
git lfs install
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.gitm3e-base下载
git clone https://www.modelscope.cn/Jerry0/m3e-base.git官方ChatGLM3 DEMO下载
地址https://github.com/THUDM/ChatGLM3 下载方法git
git clone https://github.com/THUDM/ChatGLM3配置及运行 进入刚刚clone的 ChatGLM3/openai_api_demo文件夹 打开api_server.py的python文件 代码拉倒最下方 覆盖if name “main”:方法内的代码 如下 其中一些地方需要修改tokenizer及model的地址对应的是[chatglm3](#eB5m3)的下载地址embedding_model的地址对应的是[m3e](#FHYRP)的下载地址port可根据个人需要自行配置tokenizer AutoTokenizer.from_pretrained(E:\Work\HaoQue\FastGPT\models\chatglm3-6B-32k-int4, trust_remote_codeTrue)model AutoModel.from_pretrained(E:\Work\HaoQue\FastGPT\models\chatglm3-6B-32k-int4, trust_remote_codeTrue, device_mapauto).eval()
# load Embeddingembedding_model SentenceTransformer(E:\Work\HaoQue\FastGPT\models\m3e-base, trust_remote_codeTrue, devicecuda)uvicorn.run(app, host0.0.0.0, port8000, workers1)回到ChatGLM3根目录 进入刚刚创建的windows_chatglm3-6bconda 虚拟环境cmd运行pip install -r requirements.txt安装依赖耐心等待安装完成完成后运行通过python运行 python openai_api_demo/api_server.py 查看运行结果以下为运行成功。 ollama环境搭建
ollama程序下载及模型安装安装
下载https://ollama.com/download安装直接下一步安装完成后进入cmd 输入 ollama -v验证是否成功 通过ollama进行模型下载
模型列表https://ollama.com/library这里以qwen:1.8b为例cmd运行 ollama run qwen:1.8b耐心等待下载即可下载完成模型会自动启动无需其他操作
m3e环境搭建ollama模式
使用docker进行部署docker安装在此不做介绍。
docker run -d --name m3e -p 6008:6008 --gpus all -e sk-key123321 registry.cn-hangzhou.aliyuncs.com/fastgpt_docker/m3e-large-api
one-api环境部署及配置
使用docker进行部署docker安装在此不做介绍。
docker run --name one-api -d --restart always -p 3000:3000 -e TZAsia/Shanghai -v /home/ubuntu/data/one-api:/data justsong/one-api然后访问[http://localhost:3000/](http://localhost:3001/)端口为docker run 时候-p的端口 登陆 初始账号用户名为 root密码为 123456。登陆后来到渠道页面添加渠道此步骤添加的是大语言模型如果你是通过ollama运行的大模型则需要再次添加新渠道本次添加m3e渠道新建令牌新建令牌后复制令牌sdk备用
fastgpt环境搭建 参考文档https://doc.fastai.site/docs/development/docker/ 非 Linux 环境或无法访问外网环境可手动创建一个目录并下载下面2个链接的文件: docker-compose.yml,config.json 注意: docker-compose.yml 配置文件中 Mongo 为 5.x部分服务器不支持需手动更改其镜像版本为 4.4.24需要自己在docker hub下载阿里云镜像没做备份 config.json配置文件修改 打开下载的config.json复制并替换**llmModels**数组中的第一组数据修改model和name属性为你部署的模型属性其他可以不做修改 {model: gemma:2b,name: gemma:2b,maxContext: 16000,avatar: /imgs/model/openai.svg,maxResponse: 4000,quoteMaxToken: 13000,maxTemperature: 1.2,charsPointsPrice: 0,censor: false,vision: false,datasetProcess: true,usedInClassify: true,usedInExtractFields: true,usedInToolCall: true,usedInQueryExtension: true,toolChoice: true,functionCall: true,customCQPrompt: ,customExtractPrompt: ,defaultSystemChatPrompt: ,defaultConfig: {}},如果你是ollama部署的大模型 打开下载的config.json在**vectorModels**数组中添加以下数据 {model: m3e,name: M3E,inputPrice: 0,outputPrice: 0,defaultToken: 700,maxToken: 1800,weight: 100}打开docker-compose.yml注释掉mysql 及 oneapi相关配置**启动容器 **
在 docker-compose.yml 同级目录下执行。请确保docker-compose版本最好在2.17以上否则可能无法执行自动化命令。
# 启动容器
docker-compose up -d
# 等待10sOneAPI第一次总是要重启几次才能连上Mysql
sleep 10
# 重启一次oneapi(由于OneAPI的默认Key有点问题不重启的话会提示找不到渠道临时手动重启一次解决等待作者修复)
docker restart oneapi访问 FastGPT
目前可以通过 ip:3000 直接访问(注意防火墙)。登录用户名为 root密码为docker-compose.yml环境变量里设置的 DEFAULT_ROOT_PSW。 如果需要域名访问请自行安装并配置 Nginx。 首次运行会自动初始化 root 用户密码为 1234
新建知识库 上传知识库文件 新建AI应用 开始使用 运行报错
报huggingface-hub的错 pip install huggingface-hub0.20.3
显存不足尝试设置环境变量
set PYTORCH_CUDA_ALLOC_COFFexpandable_segments:True 再次运行python api_server.py经测试无用 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 12.31 GiB is allocated by PyTorch, and 1.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONFexpandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
api_server.py整体 This script implements an API for the ChatGLM3-6B model,
formatted similarly to OpenAIs API (https://platform.openai.com/docs/api-reference/chat).
Its designed to be run as a web server using FastAPI and uvicorn,
making the ChatGLM3-6B model accessible through OpenAI Client.Key Components and Features:
- Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them.
- FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests.
- API Endpoints:- /v1/models: Lists the available models, specifically ChatGLM3-6B.- /v1/chat/completions: Processes chat completion requests with options for streaming and regular responses.- /v1/embeddings: Processes Embedding request of a list of text inputs.
- Token Limit Caution: In the OpenAI API, max_tokens is equivalent to HuggingFaces max_new_tokens, not max_length.
For instance, setting max_tokens to 8192 for a 6b model would result in an error due to the models inability to output
that many tokens after accounting for the history and prompt tokens.
- Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses.
- Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety.
- Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port.Note:This script doesnt include the setup for special tokens or multi-GPU support by default.Users need to configure their special tokens and can enable multi-GPU support as per the provided instructions.Embedding Models only support in One GPU.Running this script requires 14-15GB of GPU memory. 2 GB for the embedding model and 12-13 GB for the FP16 ChatGLM3 LLM.import os
import time
import tiktoken
import torch
import uvicornfrom fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddlewarefrom contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from loguru import logger
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, AutoModel
from utils import process_response, generate_chatglm3, generate_stream_chatglm3
from sentence_transformers import SentenceTransformerfrom sse_starlette.sse import EventSourceResponse# Set up limit request time
EventSourceResponse.DEFAULT_PING_INTERVAL 1000000# set LLM path
MODEL_PATH os.environ.get(MODEL_PATH, D:\WangMing\FastGPT\models\chatglm3-6b-copy)
TOKENIZER_PATH os.environ.get(TOKENIZER_PATH, D:\WangMing\FastGPT\models\chatglm3-6b-copy)# set Embedding Model path
EMBEDDING_PATH os.environ.get(EMBEDDING_PATH, D:\WangMing\FastGPT\models\m3e-base)asynccontextmanager
async def lifespan(app: FastAPI):yieldif torch.cuda.is_available():torch.cuda.empty_cache()torch.cuda.ipc_collect()app FastAPI(lifespanlifespan)app.add_middleware(CORSMiddleware,allow_origins[*],allow_credentialsTrue,allow_methods[*],allow_headers[*],
)class ModelCard(BaseModel):id: strobject: str modelcreated: int Field(default_factorylambda: int(time.time()))owned_by: str ownerroot: Optional[str] Noneparent: Optional[str] Nonepermission: Optional[list] Noneclass ModelList(BaseModel):object: str listdata: List[ModelCard] []class FunctionCallResponse(BaseModel):name: Optional[str] Nonearguments: Optional[str] Noneclass ChatMessage(BaseModel):role: Literal[user, assistant, system, function]content: str Nonename: Optional[str] Nonefunction_call: Optional[FunctionCallResponse] Noneclass DeltaMessage(BaseModel):role: Optional[Literal[user, assistant, system]] Nonecontent: Optional[str] Nonefunction_call: Optional[FunctionCallResponse] None## for Embedding
class EmbeddingRequest(BaseModel):input: List[str]model: strclass CompletionUsage(BaseModel):prompt_tokens: intcompletion_tokens: inttotal_tokens: intclass EmbeddingResponse(BaseModel):data: listmodel: strobject: strusage: CompletionUsage# for ChatCompletionRequestclass UsageInfo(BaseModel):prompt_tokens: int 0total_tokens: int 0completion_tokens: Optional[int] 0class ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float] 0.8top_p: Optional[float] 0.8max_tokens: Optional[int] Nonestream: Optional[bool] Falsetools: Optional[Union[dict, List[dict]]] Nonerepetition_penalty: Optional[float] 1.1class ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal[stop, length, function_call]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal[stop, length, function_call]]index: intclass ChatCompletionResponse(BaseModel):model: strid: strobject: Literal[chat.completion, chat.completion.chunk]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int] Field(default_factorylambda: int(time.time()))usage: Optional[UsageInfo] Noneapp.get(/health)
async def health() - Response:Health check.return Response(status_code200)app.post(/v1/embeddings, response_modelEmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):embeddings [embedding_model.encode(text) for text in request.input]embeddings [embedding.tolist() for embedding in embeddings]def num_tokens_from_string(string: str) - int:Returns the number of tokens in a text string.use cl100k_base tokenizerencoding tiktoken.get_encoding(cl100k_base)num_tokens len(encoding.encode(string))return num_tokensresponse {data: [{object: embedding,embedding: embedding,index: index}for index, embedding in enumerate(embeddings)],model: request.model,object: list,usage: CompletionUsage(prompt_tokenssum(len(text.split()) for text in request.input),completion_tokens0,total_tokenssum(num_tokens_from_string(text) for text in request.input),)}return responseapp.get(/v1/models, response_modelModelList)
async def list_models():model_card ModelCard(idchatglm3-6b)return ModelList(data[model_card])app.post(/v1/chat/completions, response_modelChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):global model, tokenizerif len(request.messages) 1 or request.messages[-1].role assistant:raise HTTPException(status_code400, detailInvalid request)gen_params dict(messagesrequest.messages,temperaturerequest.temperature,top_prequest.top_p,max_tokensrequest.max_tokens or 1024,echoFalse,streamrequest.stream,repetition_penaltyrequest.repetition_penalty,toolsrequest.tools,)logger.debug(f request \n{gen_params})if request.stream:# Use the stream mode to read the first few characters, if it is not a function call, direct stram outputpredict_stream_generator predict_stream(request.model, gen_params)output next(predict_stream_generator)if not contains_custom_function(output):return EventSourceResponse(predict_stream_generator, media_typetext/event-stream)# Obtain the result directly at one time and determine whether tools needs to be called.logger.debug(fFirst result output\n{output})function_call Noneif output and request.tools:try:function_call process_response(output, use_toolTrue)except:logger.warning(Failed to parse tool call)# CallFunctionif isinstance(function_call, dict):function_call FunctionCallResponse(**function_call)In this demo, we did not register any tools.You can use the tools that have been implemented in our tools_using_demo and implement your own streaming tool implementation here.Similar to the following method:function_args json.loads(function_call.arguments)tool_response dispatch_tool(tool_name: str, tool_params: dict)tool_response if not gen_params.get(messages):gen_params[messages] []gen_params[messages].append(ChatMessage(roleassistant,contentoutput,))gen_params[messages].append(ChatMessage(rolefunction,namefunction_call.name,contenttool_response,))# Streaming output of results after function callsgenerate predict(request.model, gen_params)return EventSourceResponse(generate, media_typetext/event-stream)else:# Handled to avoid exceptions in the above parsing function process.generate parse_output_text(request.model, output)return EventSourceResponse(generate, media_typetext/event-stream)# Here is the handling of stream Falseresponse generate_chatglm3(model, tokenizer, gen_params)# Remove the first newline characterif response[text].startswith(\n):response[text] response[text][1:]response[text] response[text].strip()usage UsageInfo()function_call, finish_reason None, stopif request.tools:try:function_call process_response(response[text], use_toolTrue)except:logger.warning(Failed to parse tool call, maybe the response is not a tool call or have been answered.)if isinstance(function_call, dict):finish_reason function_callfunction_call FunctionCallResponse(**function_call)message ChatMessage(roleassistant,contentresponse[text],function_callfunction_call if isinstance(function_call, FunctionCallResponse) else None,)logger.debug(f message \n{message})choice_data ChatCompletionResponseChoice(index0,messagemessage,finish_reasonfinish_reason,)task_usage UsageInfo.model_validate(response[usage])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key) usage_value)return ChatCompletionResponse(modelrequest.model,id, # for open_source model, id is emptychoices[choice_data],objectchat.completion,usageusage)async def predict(model_id: str, params: dict):global model, tokenizerchoice_data ChatCompletionResponseStreamChoice(index0,deltaDeltaMessage(roleassistant),finish_reasonNone)chunk ChatCompletionResponse(modelmodel_id, id, choices[choice_data], objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))previous_text for new_response in generate_stream_chatglm3(model, tokenizer, params):decoded_unicode new_response[text]delta_text decoded_unicode[len(previous_text):]previous_text decoded_unicodefinish_reason new_response[finish_reason]if len(delta_text) 0 and finish_reason ! function_call:continuefunction_call Noneif finish_reason function_call:try:function_call process_response(decoded_unicode, use_toolTrue)except:logger.warning(Failed to parse tool call, maybe the response is not a tool call or have been answered.)if isinstance(function_call, dict):function_call FunctionCallResponse(**function_call)delta DeltaMessage(contentdelta_text,roleassistant,function_callfunction_call if isinstance(function_call, FunctionCallResponse) else None,)choice_data ChatCompletionResponseStreamChoice(index0,deltadelta,finish_reasonfinish_reason)chunk ChatCompletionResponse(modelmodel_id,id,choices[choice_data],objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))choice_data ChatCompletionResponseStreamChoice(index0,deltaDeltaMessage(),finish_reasonstop)chunk ChatCompletionResponse(modelmodel_id,id,choices[choice_data],objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))yield [DONE]def predict_stream(model_id, gen_params):The function call is compatible with stream mode output.The first seven characters are determined.If not a function call, the stream output is directly generated.Otherwise, the complete character content of the function call is returned.:param model_id::param gen_params::return:output is_function_call Falsehas_send_first_chunk Falsefor new_response in generate_stream_chatglm3(model, tokenizer, gen_params):decoded_unicode new_response[text]delta_text decoded_unicode[len(output):]output decoded_unicode# When it is not a function call and the character length is 7,# try to judge whether it is a function call according to the special function prefixif not is_function_call and len(output) 7:# Determine whether a function is calledis_function_call contains_custom_function(output)if is_function_call:continue# Non-function call, direct stream outputfinish_reason new_response[finish_reason]# Send an empty string first to avoid truncation by subsequent next() operations.if not has_send_first_chunk:message DeltaMessage(content,roleassistant,function_callNone,)choice_data ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonfinish_reason)chunk ChatCompletionResponse(modelmodel_id,id,choices[choice_data],createdint(time.time()),objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))send_msg delta_text if has_send_first_chunk else outputhas_send_first_chunk Truemessage DeltaMessage(contentsend_msg,roleassistant,function_callNone,)choice_data ChatCompletionResponseStreamChoice(index0,deltamessage,finish_reasonfinish_reason)chunk ChatCompletionResponse(modelmodel_id,id,choices[choice_data],createdint(time.time()),objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))if is_function_call:yield outputelse:yield [DONE]async def parse_output_text(model_id: str, value: str):Directly output the text content of value:param model_id::param value::return:choice_data ChatCompletionResponseStreamChoice(index0,deltaDeltaMessage(roleassistant, contentvalue),finish_reasonNone)chunk ChatCompletionResponse(modelmodel_id, id, choices[choice_data], objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))choice_data ChatCompletionResponseStreamChoice(index0,deltaDeltaMessage(),finish_reasonstop)chunk ChatCompletionResponse(modelmodel_id, id, choices[choice_data], objectchat.completion.chunk)yield {}.format(chunk.model_dump_json(exclude_unsetTrue))yield [DONE]def contains_custom_function(value: str) - bool:Determine whether function_call according to a special function prefix.For example, the functions defined in tools_using_demo/tool_register.py are all get_xxx and start with get_[Note] This is not a rigorous judgment method, only for reference.:param value::return:return value and get_ in valueif __name__ __main__:# Load LLMtokenizer AutoTokenizer.from_pretrained(D:\WangMing\FastGPT\models\chatglm3-6b-copy, trust_remote_codeTrue)model AutoModel.from_pretrained(D:\WangMing\FastGPT\models\chatglm3-6b-copy, trust_remote_codeTrue, device_mapauto).quantize(4).eval()# load Embeddingembedding_model SentenceTransformer(D:\WangMing\FastGPT\models\chatglm3-6b-copy, trust_remote_codeTrue, devicecuda)uvicorn.run(app, host0.0.0.0, port8000, workers1)