免费网站建设的,seo推广名词解释,网页设计公司创业计划书,wordpress如何编辑首页布局1 简介
DeepSeek-Coder在多种编程语言和各种基准测试中取得了开源代码模型中最先进的性能。 为尝试在开发板进行部署#xff0c;首先利用llama.cpp对其进行量化。
2 llama.cpp安装
git clone之后进入文件夹make即可#xff0c;再将依赖补全pip install -r requirements.tx…1 简介
DeepSeek-Coder在多种编程语言和各种基准测试中取得了开源代码模型中最先进的性能。 为尝试在开发板进行部署首先利用llama.cpp对其进行量化。
2 llama.cpp安装
git clone之后进入文件夹make即可再将依赖补全pip install -r requirements.txt
3 量化
按照GitHub上DeepSeek和llama.cpp官方的信息后者对deepseek模型的量化目前的支持进度还不是很完善。 下面记录一下目前量化出现的问题。
3.1 DeepSeek官方tutorial
依照官方md
git clone https://github.com/DOGEwbx/llama.cpp.git
cd llama.cpp
git checkout regex_gpt2_preprocess出现error: pathspec regex_gpt2_preprocess did not match any file(s) known to git # set up the environment according to README
make
python3 -m pip install -r requirements.txt
# generate GGUF model
python convert-hf-to-gguf.py MODEL_PATH --outfile GGUF_PATH --model-name deepseekcoder出现convert-hf-to-gguf.py: error: unrecognized arguments: --model-name deepseekcoder
去掉--model-name参数出现NotImplementedError: Architecture LlamaForCausalLM not supported!解释。 3.2 convert.py转换
参考这个comment和这个comment使用convert.py进行转换。 看起来这个修改已经被合并了浅浅试一下。
python convert.py MODEL_PATH --outfile GGUF_PATH出现错误: Exception: Vocab size mismatch (model has 32256, but ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct has 32022). Add the --pad-vocab option and try again.
详细的log如下
Loading model file ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/model.safetensors
params Params(n_vocab32256, n_embd2048, n_layer24, n_ctx16384, n_ff5504, n_head16, n_head_kv16, n_expertsNone, n_experts_usedNone, f_norm_eps1e-06, rope_scaling_typeRopeScalingType.LINEAR: linear, f_rope_freq_base100000, f_rope_scale4.0, n_orig_ctxNone, rope_finetunedNone, ftypeNone, path_modelPosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct))
Found vocab files: {spm: None, bpe: None, hfft: PosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/tokenizer.json)}
Loading vocab file PosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/tokenizer.json), type hfft
fname_tokenizer: ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Vocab info: HfVocab with 32000 base tokens and 22 added tokens
Special vocab info: SpecialVocab with 0 merges, special tokens {bos: 32013, eos: 32021, pad: 32014}, add special tokens {bos: True, eos: False}
Permuting layer 0
Permuting layer 1
Permuting layer 2
...省略部分
Permuting layer 22
Permuting layer 23
lm_head.weight - output.weight | BF16 | [32256, 2048]
model.embed_tokens.weight - token_embd.weight | BF16 | [32256, 2048]
model.layers.0.input_layernorm.weight - blk.0.attn_norm.weight | BF16 | [2048]
model.layers.0.mlp.down_proj.weight - blk.0.ffn_down.weight | BF16 | [2048, 5504]
model.layers.0.mlp.gate_proj.weight - blk.0.ffn_gate.weight | BF16 | [5504, 2048]
...
model.layers.18.self_attn.v_proj.weight - blk.18.attn_v.weight | BF16 | [2048, 2048]
model.layers.19.input_layernorm.weight - blk.19.attn_norm.weight | BF16 | [2048]
...
model.layers.9.input_layernorm.weight - blk.9.attn_norm.weight | BF16 | [2048]
model.layers.9.mlp.down_proj.weight - blk.9.ffn_down.weight | BF16 | [2048, 5504]
model.layers.9.mlp.gate_proj.weight - blk.9.ffn_gate.weight | BF16 | [5504, 2048]
model.layers.9.mlp.up_proj.weight - blk.9.ffn_up.weight | BF16 | [5504, 2048]
model.layers.9.post_attention_layernorm.weight - blk.9.ffn_norm.weight | BF16 | [2048]
model.layers.9.self_attn.k_proj.weight - blk.9.attn_k.weight | BF16 | [2048, 2048]
model.layers.9.self_attn.o_proj.weight - blk.9.attn_output.weight | BF16 | [2048, 2048]
model.layers.9.self_attn.q_proj.weight - blk.9.attn_q.weight | BF16 | [2048, 2048]
model.layers.9.self_attn.v_proj.weight - blk.9.attn_v.weight | BF16 | [2048, 2048]
model.norm.weight - output_norm.weight | BF16 | [2048]
Writing ../DeepSeek-Coder/models/1.3b.gguf, format 1
Traceback (most recent call last):File /home/stlinpeiyang/lpy22/LLM/llama.cpp/convert.py, line 1479, in modulemain()File /home/stlinpeiyang/lpy22/LLM/llama.cpp/convert.py, line 1473, in mainOutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,File /home/stlinpeiyang/lpy22/LLM/llama.cpp/convert.py, line 1117, in write_allcheck_vocab_size(params, vocab, pad_vocabpad_vocab)File /home/stlinpeiyang/lpy22/LLM/llama.cpp/convert.py, line 963, in check_vocab_sizeraise Exception(msg)
Exception: Vocab size mismatch (model has 32256, but ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct has 32022). Add the --pad-vocab option and try again.3.2.1 添加--pad-vocab
首先显然提示添加参数根据提示加上--pad-vocab参数后成功运行并可以成功量化但是在测试时会出现以下错误
terminate called after throwing an instance of std::out_of_rangewhat(): _Map_base::at
Aborted (core dumped)这种情况有相关的issue comment这个。
从llama.cpp的pull request和issue来看应该是还没处理好。菜鸡只能嗷嗷待哺了 。不知道TheBloke大佬是怎么处理的。 表情网站 3.2.2 修改vocab_size
其次根据错误的前半段的model has 32256, but ... has 32022有类似的issue. 根据comment对vocal_size进行修改。相应地打开deepseek-coder-1.3b-instruct中的config.json文件试将vocab_size: 32256修改为vocal_size: 32022。再次运行
python convert.py MODEL_PATH --outfile GGUF_PATH输出的log如下
Loading model file ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/model.safetensors
params Params(n_vocab32022, n_embd2048, n_layer24, n_ctx16384, n_ff5504, n_head16, n_head_kv16, n_expertsNone, n_experts_usedNone, f_norm_eps1e-06, rope_scaling_typeRopeScalingType.LINEAR: linear, f_rope_freq_base100000, f_rope_scale4.0, n_orig_ctxNone, rope_finetunedNone, ftypeNone, path_modelPosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct))
Found vocab files: {spm: None, bpe: None, hfft: PosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/tokenizer.json)}
Loading vocab file PosixPath(../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct/tokenizer.json), type hfft
fname_tokenizer: ../DeepSeek-Coder/models/deepseek-coder-1.3b-instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Vocab info: HfVocab with 32000 base tokens and 22 added tokens
Special vocab info: SpecialVocab with 0 merges, special tokens {bos: 32013, eos: 32021, pad: 32014}, add special tokens {bos: True, eos: False}
Permuting layer 0
Permuting layer 1
Permuting layer 2
...省略部分
lm_head.weight - output.weight | BF16 | [32256, 2048]
model.embed_tokens.weight - token_embd.weight | BF16 | [32256, 2048]
model.layers.0.input_layernorm.weight - blk.0.attn_norm.weight | BF16 | [2048]
model.layers.0.mlp.down_proj.weight - blk.0.ffn_down.weight | BF16 | [2048, 5504]
model.layers.0.mlp.gate_proj.weight - blk.0.ffn_gate.weight | BF16 | [5504, 2048]
model.layers.0.mlp.up_proj.weight - blk.0.ffn_up.weight | BF16 | [5504, 2048]
model.layers.0.post_attention_layernorm.weight - blk.0.ffn_norm.weight | BF16 | [2048]
model.layers.0.self_attn.k_proj.weight - blk.0.attn_k.weight | BF16 | [2048, 2048]
model.layers.0.self_attn.o_proj.weight - blk.0.attn_output.weight | BF16 | [2048, 2048]
model.layers.0.self_attn.q_proj.weight - blk.0.attn_q.weight | BF16 | [2048, 2048]
model.layers.0.self_attn.v_proj.weight - blk.0.attn_v.weight
...省略部分
model.layers.9.self_attn.q_proj.weight - blk.9.attn_q.weight | BF16 | [2048, 2048]
model.layers.9.self_attn.v_proj.weight - blk.9.attn_v.weight | BF16 | [2048, 2048]
model.norm.weight - output_norm.weight | BF16 | [2048]
Writing ../DeepSeek-Coder/models/1.3b.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
gguf: Setting special token type bos to 32013
gguf: Setting special token type eos to 32021
gguf: Setting special token type pad to 32014
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}
{% set add_generation_prompt false %}
{% endif %}
{%- set ns namespace(foundfalse) -%}
{%- for message in messages -%}{%- if message[role] system -%}{%- set ns.found true -%}{%- endif -%}
{%- endfor -%}
{{bos_token}}{%- if not ns.found -%}
{{You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer\n}}
{%- endif %}
{%- for message in messages %}{%- if message[role] system %}
{{ message[content] }}{%- else %}{%- if message[role] user %}
{{### Instruction:\n message[content] \n}}{%- else %}
{{### Response:\n message[content] \n|EOT|\n}}{%- endif %}{%- endif %}
{%- endfor %}
{% if add_generation_prompt %}
{{### Response:}}
{% endif %}
[ 1/219] Writing tensor output.weight | size 32256 x 2048 | type F16 | T 0
[ 2/219] Writing tensor token_embd.weight | size 32256 x 2048 | type F16 | T 0
...省略部分
[216/219] Writing tensor blk.9.attn_output.weight | size 2048 x 2048 | type F16 | T 2
[217/219] Writing tensor blk.9.attn_q.weight | size 2048 x 2048 | type F16 | T 2
[218/219] Writing tensor blk.9.attn_v.weight | size 2048 x 2048 | type F16 | T 2
[219/219] Writing tensor output_norm.weight | size 2048 | type F32 | T 2
Wrote ../DeepSeek-Coder/models/1.3b.gguf成功生成gguf文件。下一步进行量化
./quantize ${out_model.gguf} ${out_model-q5_0.gguf} q5_0输出log如下
main: build 1 (231ae28)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: quantizing ../DeepSeek-Coder/models/1.3b.gguf to ../DeepSeek-Coder/models/1.3b-q5_0.gguf as Q5_0
llama_model_loader: loaded meta data with 24 key-value pairs and 219 tensors from ../DeepSeek-Coder/models/1.3b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str llama
llama_model_loader: - kv 1: general.name str models
llama_model_loader: - kv 2: llama.context_length u32 16384
llama_model_loader: - kv 3: llama.embedding_length u32 2048
llama_model_loader: - kv 4: llama.block_count u32 24
llama_model_loader: - kv 5: llama.feed_forward_length u32 5504
llama_model_loader: - kv 6: llama.rope.dimension_count u32 128
llama_model_loader: - kv 7: llama.attention.head_count u32 16
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 16
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 100000.000000
llama_model_loader: - kv 11: llama.rope.scaling.type str linear
llama_model_loader: - kv 12: llama.rope.scaling.factor f32 4.000000
llama_model_loader: - kv 13: general.file_type u32 1
llama_model_loader: - kv 14: tokenizer.ggml.model str llama
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32022] [!, \, #, $, %, , , ...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32022] [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32022] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 32013
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 32021
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 32014
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool false
llama_model_loader: - kv 23: tokenizer.chat_template str {% if not add_generation_prompt is de...
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type f16: 170 tensors
llama_model_quantize_internal: meta size 767616 bytes
[ 1/ 219] output.weight - [ 2048, 32256, 1, 1], type f16, quantizing to q6_K .. size 126.00 MiB - 51.68 MiB
[ 2/ 219] token_embd.weight - [ 2048, 32256, 1, 1], type f16, quantizing to q5_0 .. size 126.00 MiB - 43.31 MiB | hist: 0.040 0.018 0.028 0.043 0.061 0.082 0.101 0.114 0.117 0.109 0.092 0.072 0.052 0.035 0.022 0.016
...
[ 218/ 219] blk.9.attn_v.weight - [ 2048, 2048, 1, 1], type f16, quantizing to q5_0 .. size 8.00 MiB - 2.75 MiB | hist: 0.040 0.017 0.028 0.042 0.060 0.081 0.101 0.116 0.121 0.109 0.091 0.071 0.051 0.034 0.022 0.016
[ 219/ 219] output_norm.weight - [ 2048, 1, 1, 1], type f32, size 0.008 MB
llama_model_quantize_internal: model size 2568.38 MB
llama_model_quantize_internal: quant size 891.50 MB
llama_model_quantize_internal: hist: 0.040 0.017 0.028 0.043 0.061 0.082 0.101 0.114 0.118 0.109 0.092 0.071 0.051 0.035 0.022 0.016main: quantize time 9300.54 ms
main: total time 9300.54 ms进行测试
./main -m ../DeepSeek-Coder/models/1.3b-q5_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r User: -f ./prompts/chat-with-bob.txt -ngl 20加载模型失败.
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build 1 (231ae28)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed 1710571501
llama_model_loader: loaded meta data with 25 key-value pairs and 219 tensors from ../DeepSeek-Coder/models/1.3b-q5_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str llama
llama_model_loader: - kv 1: general.name str models
llama_model_loader: - kv 2: llama.context_length u32 16384
llama_model_loader: - kv 3: llama.embedding_length u32 2048
llama_model_loader: - kv 4: llama.block_count u32 24
llama_model_loader: - kv 5: llama.feed_forward_length u32 5504
llama_model_loader: - kv 6: llama.rope.dimension_count u32 128
llama_model_loader: - kv 7: llama.attention.head_count u32 16
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 16
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 100000.000000
llama_model_loader: - kv 11: llama.rope.scaling.type str linear
llama_model_loader: - kv 12: llama.rope.scaling.factor f32 4.000000
llama_model_loader: - kv 13: general.file_type u32 8
llama_model_loader: - kv 14: tokenizer.ggml.model str llama
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32022] [!, \, #, $, %, , , ...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32022] [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32022] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 32013
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 32021
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 32014
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool false
llama_model_loader: - kv 23: tokenizer.chat_template str {% if not add_generation_prompt is de...
llama_model_loader: - kv 24: general.quantization_version u32 2
llama_model_loader: - type f32: 49 tensors
llama_model_loader: - type q5_0: 169 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: SPM vocabulary, but newline token not found: _Map_base::at! Using special_pad_id instead.llm_load_vocab: mismatch in special tokens definition ( 9/32022 vs 22/32022 ).
llm_load_print_meta: format GGUF V3 (latest)
llm_load_print_meta: arch llama
llm_load_print_meta: vocab type SPM
llm_load_print_meta: n_vocab 32022
llm_load_print_meta: n_merges 0
llm_load_print_meta: n_ctx_train 16384
llm_load_print_meta: n_embd 2048
llm_load_print_meta: n_head 16
llm_load_print_meta: n_head_kv 16
llm_load_print_meta: n_layer 24
llm_load_print_meta: n_rot 128
llm_load_print_meta: n_embd_head_k 128
llm_load_print_meta: n_embd_head_v 128
llm_load_print_meta: n_gqa 1
llm_load_print_meta: n_embd_k_gqa 2048
llm_load_print_meta: n_embd_v_gqa 2048
llm_load_print_meta: f_norm_eps 0.0e00
llm_load_print_meta: f_norm_rms_eps 1.0e-06
llm_load_print_meta: f_clamp_kqv 0.0e00
llm_load_print_meta: f_max_alibi_bias 0.0e00
llm_load_print_meta: n_ff 5504
llm_load_print_meta: n_expert 0
llm_load_print_meta: n_expert_used 0
llm_load_print_meta: pooling type 0
llm_load_print_meta: rope type 0
llm_load_print_meta: rope scaling linear
llm_load_print_meta: freq_base_train 100000.0
llm_load_print_meta: freq_scale_train 0.25
llm_load_print_meta: n_yarn_orig_ctx 16384
llm_load_print_meta: rope_finetuned unknown
llm_load_print_meta: model type ?B
llm_load_print_meta: model ftype Q5_0
llm_load_print_meta: model params 1.35 B
llm_load_print_meta: model size 891.50 MiB (5.55 BPW)
llm_load_print_meta: general.name models
llm_load_print_meta: BOS token 32013 begin▁of▁sentence
llm_load_print_meta: EOS token 32021 |EOT|
llm_load_print_meta: UNK token 0 !
llm_load_print_meta: PAD token 32014 end▁of▁sentence
llm_load_tensors: ggml ctx size 0.08 MiB
llama_model_load: error loading model: create_tensor: tensor token_embd.weight has wrong shape; expected 2048, 32022, got 2048, 32256, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model ../DeepSeek-Coder/models/1.3b-q5_0.gguf
main: error: unable to load model看错误llama_model_load: error loading model: create_tensor: tensor token_embd.weight has wrong shape; expected 2048, 32022, got 2048, 32256, 1, 1应该是跟前面修改的vocab-size有关。