当前位置：首页 > news >正文

顺德网站建设找顺的水源logo设计制作网

news 2025/11/14 18:27:17

顺德网站建设找顺的,水源logo设计制作网,免代码开发平台,矿泉水网站模板现有的所有模型都无法做到在线学习#xff0c;能力有限#xff0c;而让大模型拥有一个tools工具库#xff0c;则可以使大模型变成一个交互式的工具去协调调用API完成任务#xff0c;同时GPT4还联网了#xff0c;可以不断地更新自己的知识库多模态模型#xff0c;接受文…现有的所有模型都无法做到在线学习能力有限而让大模型拥有一个tools工具库则可以使大模型变成一个交互式的工具去协调调用API完成任务同时GPT4还联网了可以不断地更新自己的知识库多模态模型接受文本、图像的输入由于GPT4论文展现的技术细节较少安全性展示较少所以不做精读而是读一篇openai官网对GPT4的一份调查报告 GPT-4 (openai.com) We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails. 翻译我们创建了GPT-4这是开放AI在深度学习规模化方面的最新里程碑。GPT-4是一个大型多模态模型接受图像和文本输入输出文本虽然在许多现实场景中不如人类那样具有高超的能力但在各种专业和学术基准测试中表现出人类水平的性能。例如它在模拟的律师资格考试中的得分约为测试者中前10%相比之下GPT-3.5的得分约为测试者中最后10%。我们花费了6个月的时间通过从我们的对抗测试计划和ChatGPT中得到的经验教训迭代地调整GPT-4从而取得了我们有史以来在事实性、可操纵性和拒绝超出范围方面的最佳结果尽管还远非完美。总结效果好从去年8月以来做了很多安全性测试 Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety. 翻译在过去的两年里我们重新构建了整个深度学习堆栈并与Azure合作从头开始设计了一台超级计算机以适应我们的工作负载。一年前我们训练了GPT-3.5作为系统的第一个“测试运行”。我们发现并修复了一些错误并改进了我们的理论基础。结果我们的GPT-4训练运行至少对我们来说是前所未有的稳定成为我们能够提前准确预测其训练性能的第一个大型模型。随着我们继续专注于可靠的扩展我们的目标是完善我们的方法论以帮助我们预测和准备未来能力这对安全至关重要。总结 GPT4可以准确预测本次训练的预期结果不必等到模型完全训练完成才能知道这组参数有没有用想法有没有work 可以精准预测在小模型上做的消融实验放到大模型上的结果而不会受到涌现的影响 Capabilities In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022–2023 editions of practice exams. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details. 翻译在非正式的对话中GPT-3.5和GPT-4之间的区别可能是微妙的。当任务的复杂性达到足够的阈值时这种区别就会显现出来——相比于GPT-3.5GPT-4更可靠、更具创造力能够处理比较微妙的指令。为了理解这两个模型之间的区别我们在各种基准测试中进行了测试包括模拟最初设计给人类的考试。我们首先使用最近公开可用的测试在奥林匹克和AP自由回答问题的情况下或购买了2022-2023年版的练习考试。我们并未为这些考试进行特定的训练。在考试中只有少数问题是模型在训练期间见过的但我们相信结果是具有代表性的——详细信息请参阅我们的技术报告。总结找了些bechmark来比较3.5和4的能力深绿色是加上了图片输入效果提升了 GPT系列在数学方面还是不行 We also evaluated GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols: 翻译我们还在为机器学习模型设计的传统基准测试上评估了GPT-4。与现有的大型语言模型以及大多数最先进的模型相比GPT-4表现出了显著的优势这些模型可能包括特定于基准测试的设计或额外的训练协议总结刷一刷之前文本领域的bechmark全面超过LM的SOTA Many existing ML benchmarks are written in English. To get an initial sense of capability in other languages, we translated the MMLU benchmark—a suite of 14,000 multiple-choice problems spanning 57 subjects—into a variety of languages using Azure Translate (see Appendix). In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili: 翻译许多现有的机器学习基准测试都是用英文编写的。为了初步了解在其他语言中的能力我们使用Azure翻译将MMLU基准测试包含14,000个跨越57个学科的多选题翻译成了多种语言详见附录。在测试的26种语言中的24种语言中GPT-4在性能上优于GPT-3.5和其他大型语言模型如Chinchilla、PaLM的英文表现包括对拉脱维亚语、威尔士语和斯瓦希里语等资源匮乏的语言总结 Visual inputs GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, it generates text outputs (natural language, code, etc.) given inputs consisting of interspersed text and images. Over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs. Furthermore, it can be augmented with test-time techniques that were developed for text-only language models, including few-shot and chain-of-thought prompting. Image inputs are still a research preview and not publicly available. 翻译 GPT-4可以接受包含文本和图像的提示这与仅文本设置相对应使用户能够指定任何视觉或语言任务。具体而言它能够生成文本输出自然语言、代码等给定由交错的文本和图像组成的输入。在包括文本和照片、图表或屏幕截图在内的一系列领域中GPT-4表现出与仅文本输入相似的能力。此外它还可以利用针对仅文本语言模型开发的测试时间技术进行增强包括少样本学习和思维链提示。图像输入仍处于研究预览阶段尚未公开提供。总结让GPT4说出图片搞笑的点先对图片做OCR然后讲法语翻译成英文然后解题让GPT读论文并总结 We preview GPT-4’s performance by evaluating it on a narrow suite of standard academic vision benchmarks. However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle. We plan to release further analyses and evaluation numbers as well as thorough investigation of the effect of test-time techniques soon. 翻译在评估GPT-4时我们使用了一系列标准学术视觉基准测试但这些数据并不能完全代表其能力的广泛程度因为我们不断发现模型能够处理新的、令人兴奋的任务。我们计划很快发布更多的分析和评估数据以及对测试时间技术效果的彻底调查。总结在视觉多模态领域比较 Steerability We’ve been working on each aspect of the plan outlined in our post about defining the behavior of AIs, including steerability. Rather than the classic ChatGPT personality with a fixed verbosity, tone, and style, developers (and soon ChatGPT users) can now prescribe their AI’s style and task by describing those directions in the “system” message. System messages allow API users to significantly customize their users’ experience within bounds. We will keep making improvements here (and particularly know that system messages are the easiest way to “jailbreak” the current model, i.e., the adherence to the bounds is not perfect), but we encourage you to try it out and let us know what you think. 翻译我们一直在执行我们在关于定义AI行为的帖子中概述的计划的每个方面包括可操控性。与固定的冗长度、语气和风格的经典ChatGPT个性不同开发人员以及很快会是ChatGPT用户现在可以通过在“系统”消息中描述这些方向来指定他们的AI的风格和任务。系统消息允许API用户在一定范围内显着自定义他们用户的体验。我们将继续在这方面进行改进特别是要知道系统消息是“越狱”当前模型的最简单方式即对边界的遵守并不完美但我们鼓励您尝试一下并告诉我们您的想法。总结 System Message可以定义AI用什么样的语气和你说话 Limitations Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case. While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration). GPT-4 scores 40% higher than our latest GPT-3.5 on our internal adversarial factuality evaluations: 翻译尽管具备了强大的功能但GPT-4仍然存在着与之前的GPT模型类似的局限性。最重要的是它仍然不能完全可靠会“幻想”事实并产生推理错误。在使用语言模型输出时应该格外小心特别是在高风险的情境中确切的协议如人工审查、使用额外的上下文来进行确认或者完全避免高风险的使用应该根据特定用例的需求来确定。尽管这仍然是一个真正的问题但相对于以往的模型它们自身在每次迭代中都有所改进GPT-4显著减少了幻觉现象。在我们内部对抗性事实性评估中GPT-4的得分比我们最新的GPT-3.5高出40% 总结幻觉 The model can have various biases in its outputs—we have made progress on these but there’s still more to do. Per our recent blog post, we aim to make AI systems we build have reasonable default behaviors that reflect a wide swathe of users’ values, allow those systems to be customized within broad bounds, and get public input on what those bounds should be. GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021), and does not learn from its experience. It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains, or be overly gullible in accepting obvious false statements from a user. And sometimes it can fail at hard problems the same way humans do, such as introducing security vulnerabilities into code it produces. GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced. 翻译模型在输出中可能存在各种偏见——我们在这方面已经取得了进展但还有更多工作要做。根据我们最近的博客文章我们的目标是使我们构建的AI系统具有合理的默认行为反映了广泛用户价值观的范围允许在广泛范围内定制这些系统并征求公众对这些边界应该是什么的意见。 GPT-4通常缺乏对其数据大部分截止时间2021年9月之后发生的事件的了解并且不从其经验中学习。它有时会产生简单的推理错误这与在如此多的领域中表现出的能力不一致或者在接受用户明显错误的陈述时过于轻信。有时它会以与人类相同的方式在难题上失败比如在生成的代码中引入安全漏洞。 GPT-4在预测中也可能会自信满满地犯错在可能犯错时不进行仔细检查。有趣的是基础的预训练模型是高度校准的它对答案的预测置信度通常与正确的概率相匹配。然而通过我们目前的后训练过程校准程度会降低。总结偏见训练数据有时限性对人服从过于轻信模型对其预测的信心与正确概率密切匹配。虚线对角线代表完美的校准。经过RLHF后反而不行了更有主观性了 Risks mitigations We’ve been iterating on GPT-4 to make it safer and more aligned from the beginning of training, with efforts including selection and filtering of the pretraining data, evaluations and expert engagement, model safety improvements, and monitoring and enforcement. GPT-4 poses similar risks as previous models, such as generating harmful advice, buggy code, or inaccurate information. However, the additional capabilities of GPT-4 lead to new risk surfaces. To understand the extent of these risks, we engaged over 50 experts from domains such as AI alignment risks, cybersecurity, biorisk, trust and safety, and international security to adversarially test the model. Their findings specifically enabled us to test model behavior in high-risk areas which require expertise to evaluate. Feedback and data from these experts fed into our mitigations and improvements for the model; for example, we’ve collected additional data to improve GPT-4’s ability to refuse requests on how to synthesize dangerous chemicals. GPT-4 incorporates an additional safety reward signal during RLHF training to reduce harmful outputs (as defined by our usage guidelines) by training the model to refuse requests for such content. The reward is provided by a GPT-4 zero-shot classifier judging safety boundaries and completion style on safety-related prompts. To prevent the model from refusing valid requests, we collect a diverse dataset from various sources (e.g., labeled production data, human red-teaming, model-generated prompts) and apply the safety reward signal (with a positive or negative value) on both allowed and disallowed categories. Our mitigations have significantly improved many of GPT-4’s safety properties compared to GPT-3.5. We’ve decreased the model’s tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often. 翻译我们一直在对GPT-4进行迭代从训练开始就使其更安全和更对齐包括对预训练数据的选择和过滤、评估和专家参与、模型安全改进以及监控和执行等方面的努力。 GPT-4存在与之前模型类似的风险例如生成有害建议、错误的代码或不准确的信息。然而GPT-4的额外功能导致了新的风险面。为了了解这些风险的程度我们与来自AI对齐风险、网络安全、生物风险、信任与安全以及国际安全等领域的50多位专家进行了对抗性测试。他们的发现特别让我们能够测试模型在需要专业知识评估的高风险领域中的行为。这些专家的反馈和数据为我们的模型缓解和改进提供了支持例如我们收集了额外的数据以提高GPT-4拒绝合成危险化学物质的能力。在RLHF训练期间GPT-4集成了一个额外的安全奖励信号以减少有害输出根据我们的使用准则定义通过训练模型拒绝对此类内容的请求。奖励由一个GPT-4零样分类器提供用于判断安全边界和安全相关提示的完成样式。为了防止模型拒绝有效请求我们从各种来源收集了多样化的数据集例如标记的生产数据、人类红队测试、模型生成的提示并在允许和不允许的类别上应用安全奖励信号具有正值或负值。我们的缓解措施相比于GPT-3.5显著改善了GPT-4的许多安全属性。与GPT-3.5相比我们将模型对不允许内容的请求的反应倾向减少了82%而GPT-4对敏感请求例如医疗建议和自伤行为的回应频率根据我们的政策提高了29%。总结 1人力收集数据提高安全性 2利用自己做了个reward signal是从预训练好的gpt4中拿出来对prompt进行sensitive检测 Overall, our model-level interventions increase the difficulty of eliciting bad behavior but doing so is still possible. Additionally, there still exist “jailbreaks” to generate content which violate our usage guidelines. As the “risk per token” of AI systems increases, it will become critical to achieve extremely high degrees of reliability in these interventions; for now it’s important to complement these limitations with deployment-time safety techniques like monitoring for abuse. GPT-4 and successor models have the potential to significantly influence society in both beneficial and harmful ways. We are collaborating with external researchers to improve how we understand and assess potential impacts, as well as to build evaluations for dangerous capabilities that may emerge in future systems. We will soon share more of our thinking on the potential social and economic impacts of GPT-4 and other AI systems. 翻译总体而言我们在模型层面的干预措施增加了引发不良行为的难度但这仍然是可能的。此外仍然存在“越狱”的方法来生成违反我们使用准则的内容。随着AI系统的“风险每个令牌”的增加在这些干预措施中实现极高的可靠性将变得至关重要目前将这些限制与部署时的安全技术相结合例如监控滥用情况是非常重要的。 GPT-4和后续模型有潜力在社会上产生显着的积极和消极影响。我们正在与外部研究人员合作改进我们对潜在影响的理解和评估方法以及为可能在未来系统中出现的危险功能构建评估方法。我们很快将分享更多关于GPT-4和其他AI系统潜在社会和经济影响的思考。 Training process Like previous GPT models, the GPT-4 base model was trained to predict the next word in a document, and was trained using publicly available data (such as internet data) as well as data we’ve licensed. The data is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas. 翻译与先前的GPT模型一样GPT-4基础模型被训练以预测文档中的下一个词并且使用了公开可用的数据如互联网数据以及我们许可的数据进行训练。这些数据是一个网络规模的数据语料库包括数学问题的正确和错误解决方案弱和强的推理自相矛盾和一致的陈述代表了各种意识形态和观念。 So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user’s intent. To align it with the user’s intent within guardrails, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF). 翻译在面对问题时基础模型可能会以多种与用户意图相距甚远的方式进行回应。为了使其在用户意图的框架内对齐我们使用强化学习与人类反馈RLHF对模型的行为进行微调。总结基础模型有的时候的回答会和人想要的回答相差很远所以用RLHF的技术微调了一下 Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions. 翻译请注意模型的能力似乎主要来自预训练过程——强化学习与人类反馈并未改善考试表现没有主动的努力实际上它会降低考试表现。但是对模型的引导来自于后训练过程——基础模型需要及时的工程处理甚至知道它应该回答这些问题。总结虽然RLHF并没有带来很好的提分但RLHF还是控制了模型生成人更愿意接受的回答方式 Predictable scaling A large focus of the GPT-4 project has been building a deep learning stack that scales predictably. The primary reason is that, for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. We developed infrastructure and optimization that have very predictable behavior across multiple scales. To verify this scalability, we accurately predicted in advance GPT-4’s final loss on our internal codebase (not part of the training set) by extrapolating from models trained using the same methodology but using 10,000x less compute: 翻译 GPT-4项目的一个主要关注点是构建一个可预测扩展的深度学习堆栈。主要原因是对于像GPT-4这样的非常大的训练运行进行大量的模型特定调整是不可行的。我们开发了基础设施和优化这些基础设施和优化在多个规模上表现出非常可预测的行为。为了验证这种可扩展性我们通过从使用相同方法论进行训练但计算量减少了10,000倍的模型进行外推准确预测了GPT-4在我们的内部代码库不属于训练集上的最终损失总结这么大的模型不可能去做大规模的调参就算有大量的机器并行去跑loss也容易跑飞 OpenAI研发了一套infra和优化方法在多个尺度上实现了训练的稳定性刚开始训练的时候就已经能预测出最终的loss了这里可以看出GPT4确实很好地拟合出了loss的曲线 Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and hindsight neglect was one of the winners. Just like with another recent result, GPT-4 reverses the trend: 翻译一些能力仍然很难预测。例如反比例缩放奖是一个竞赛旨在找到一个随着模型计算增加而变差的指标而事后忽视就是其中之一的获奖者。就像另一个最近的结果一样GPT-4扭转了这一趋势总结这个competition是GPT3出的时候故意找的用来证明大模型不是所有任务都比小模型要好牛头不对马嘴的任务中GPT4反而出现了非理性的选择跳出了逻辑为结果服务 API gpt-4 has a context length of 8,192 tokens. We are also providing limited access to our 32,768–context (about 50 pages of text) version, gpt-4-32k, which will also be updated automatically over time (current version gpt-4-32k-0314, also supported until June 14). Pricing is $0.06 per 1K prompt tokens and $0.12 per 1k completion tokens. We are still improving model quality for long context and would love feedback on how it performs for your use-case. We are processing requests for the 8K and 32K engines at different rates based on capacity, so you may receive access to them at different times. 翻译 GPT-4的上下文长度为8,192个标记。我们还提供对我们的32,768个标记约50页文本版本gpt-4-32k的有限访问权限该版本也将随时间自动更新当前版本为gpt-4-32k-0314支持至6月14日。定价为每1,000个提示标记0.06美元每1,000个完成标记0.12美元。我们仍在改进长上下文的模型质量并希望了解它在您的使用案例中的表现。我们根据容量以不同的速度处理对8K和32K引擎的请求因此您可能会在不同的时间收到对它们的访问权限。 Conclusion We look forward to GPT-4 becoming a valuable tool in improving people’s lives by powering many applications. There’s still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model. 翻译我们期待着GPT-4成为一个有价值的工具通过为许多应用程序提供支持来改善人们的生活。还有很多工作要做我们期待着通过社区的集体努力不断改进这个模型探索和为模型做出贡献。

查看全文

http://www.zqtcl.cn/news/883584/