当前位置：首页 > news >正文

公司网站建设汇报商务网站策划书

news 2025/11/15 4:01:09

公司网站建设汇报,商务网站策划书,图书馆门户网站建设有哪些公司,wordpress slides一、说明我在日常工作中不断遇到的一项挑战是在无法访问黄金标准标签的情况下标记文本数据。这绝不是一项微不足道的任务#xff0c;在本文中#xff0c;我将向您展示一种相对准确地完成此任务的方法#xff0c;同时保持管道的可解释性和易于调整。一些读者可能已经开始考… 一、说明我在日常工作中不断遇到的一项挑战是在无法访问黄金标准标签的情况下标记文本数据。这绝不是一项微不足道的任务在本文中我将向您展示一种相对准确地完成此任务的方法同时保持管道的可解释性和易于调整。一些读者可能已经开始考虑使用基于变压器的零样本模型来完成此任务或LLM但在这里我将提出几个原因来说明为什么这可能会出现问题 ——零样本学习是一个黑匣子。理解提示/类标签选择的含义相当困难而且您很少对所有看似随意的选择会影响什么有很好的直觉。 — Transformer 和 LLM 速度慢且成本高。使用 OpenAI 的 API 需要花费大量金钱而且由于速度相当慢而可能不切实际。您当然可以自行托管一个较小的变压器模型但如果您希望事情变得敏捷且响应迅速这通常是生产中的要求它仍然需要大量计算资源。在这种情况下我想说主题模型是一个非常合理的折衷方案。它们可能不如零样本变压器模型那么智能并且您将必须做更多的体力劳动才能在实践中使用它们但它们可以让您对过程进行更细粒度的控制并给出更可解释的结果更不用说性能优势了。二、工作流程在本文中我将引导您完成创建机器学习管道的工作流程以使用主题模型和良好的旧冷硬算法规则来标记小说文本。 2.1 数据为了证明我的观点我将做一些作弊我将使用带标签的数据集来证明我提出的方法的有效性。不过我只会使用标签进行评估创建管道的整个过程将基于无监督学习和我们自己的人类直觉。该数据集有 20 个新闻组您可以使用 scikit-learn 轻松加载。 pip install scikit-learn from sklearn.datasets import fetch_20newsgroups import numpy as npnewsgroups fetch_20newsgroups(subsetall) corpus newsgroups.data# Sklearn gives the labels back as integers, we have to map them back to # the actual textual label. group_labels [newsgroups.target_names[label] for label in newsgroups.target]print(np.unique(group_labels)) ------------------------------------------------------------------ array([alt.atheism, comp.graphics, comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware, comp.sys.mac.hardware,comp.windows.x, misc.forsale, rec.autos, rec.motorcycles,rec.sport.baseball, rec.sport.hockey, sci.crypt,sci.electronics, sci.med, sci.space,soc.religion.christian, talk.politics.guns,talk.politics.mideast, talk.politics.misc,talk.religion.misc], dtypeU24) 假设我想根据文本是否与空间相关来标记文本以便我们可以将无监督分类性能与实际标签进行比较。我什至会将数据分成训练集和测试集这样我们就可以确保我们不会获知模型正在测试的信息又名主题模型不会在测试中进行训练放。 from sklearn.model_selection import train_test_split is_about_space np.array(group_labels) sci.spaceX_train, X_test, y_train, y_test train_test_split(corpus, is_about_space) 2.2 无监督模型我们可以使用topicwizard创建易于使用的主题管道然后解释主题。 pip install topicwizard 在 topicwizard 中创建主题管道被认为是将矢量化器和分解模型链接在一起。作为矢量化器我们将使用 scikit-learn 内置的 CountVectorizer 并设置一些看起来合理的默认频率截止值并将过滤英语停用词。我们将使用非负矩阵分解作为我们的主题模型因为它非常快训练和推理并且通常工作得相当好。我将确定 30 个主题这是一个完全任意的数字。 from sklearn.decomposition import NMF from sklearn.feature_extraction.text import CountVectorizer from topicwizard.pipeline import make_topic_pipeline# Setting up topic modelling pipeline vectorizer CountVectorizer(max_df0.5, min_df10, stop_wordsenglish) # NMF topic model with 20 topics nmf NMF(n_components30)topic_pipeline make_topic_pipeline(vectorizer, nmf) topic_pipeline.fit(X_train) 2.3 模型解读 topicwizard 附带了许多内置的可视化来解释主题模型。现在我们主要感兴趣的是哪些主题可能包含大部分与空间相关的单词。为此我们将使用 topicwizard 的图形 API。首先我们通过创建一组条形图来查看与每个主题最相关的单词。 from topicwizard.figures import topic_barchartstopic_barcharts(X_train, pipelinetopic_pipeline, top_n5) 不幸的是其中大部分都是垃圾我们可能应该更好地清理数据集但23_satellite_space_launch_nasa看起来相当有前途。让我们看一下主题模型中的单词映射以了解它们彼此之间的位置关系。 from topicwizard.figures import word_mapword_map(X_train, pipelinetopic_pipeline, top_n5) 我们可以看到有一组单词非常空旷而且它们的位置也与某些与技术相关的单词非常接近。我们还可以检查“space”和“astro”这两个词属于哪些主题以及它们最接近的 20 个关联。我们只会显示前 8 个主题。 from topicwizard.figures import word_association_barchartfig word_association_barchart([space, astro],corpusX_train,pipelinetopic_pipeline,n_association20,top_n8 ) 我们可以看到到目前为止最主要的主题是我们已经确定的主题从现在开始我将在我们的分析中只关注这一主题。让我们转换我们的训练语料库并查看该主题的重要性分布以便我们可以选择合理的阈值。首先我将主题模型的输出设置为数据框然后我们可以通过绘图直方图看到分布。 import plotly.express as pxtopic_pipeline.set_output(transformpandas)topic_df topic_pipeline.transform(X_train) px.histogram(topic_df[23_satellite_space_launch_nasa]) 我们可以看到绝大多数文本都在0.1以下。我说我们尝试将阈值设置为 0.05然后查看从中得到的随机文本样本。 topic_df[content] X_train sample topic_df[topic_df[23_satellite_space_launch_nasa] 0.05].content.sample(10) for text in sample:print(text[:200]) From: CPKJPvm.cc.latech.edu (Kevin Parker) Subject: Insurance Rates on Performance Cars SUMMARY Organization: Louisiana Tech University Lines: 244 NNTP-Posting-Host: vm.cc.latech.edu X-Newsreader: NN From: pjseuclid.JPL.NASA.GOV (Peter J. Scott) Subject: Re: Did Microsoft buy Xhibition?? Organization: Jet Propulsion Laboratory, NASA/Caltech Lines: 8 Distribution: world Reply-To: pjseuclid.jpl.na From: mlchiron.astro.uu.se (Mats Lindgren) Subject: Re: Comet in Temporary Orbit Around Jupiter? Organization: Uppsala University Lines: 14 Distribution: world NNTP-Posting-Host: chiron.astro.uu.seFrom: mikegordian.com (Michael A. Thomas) Subject: Re: The Role of the National News Media in Inflaming Passions Organization: Gordian; Costa Mesa, CA Distribution: ca Lines: 13In article 1qjtmjIN From: leechcs.unc.edu (Jon Leech) Subject: Space FAQ 04/15 - Calculations Supersedes: math_730956451cs.unc.edu Organization: University of North Carolina, Chapel Hill Lines: 334 Distribution: worl From: wlscalvin.usc.edu (Bill Scheding) Subject: Re: Full page PB screen Organization: University of Southern California, Los Angeles, CA Lines: 14 Distribution: world NNTP-Posting-Host: calvin.usc From: ghelfviolet.berkeley.edu (;;;;RD48) Subject: Re: Soyuz and Shuttle Comparisons Organization: University of California, Berkeley Lines: 11 NNTP-Posting-Host: violet.berkeley.eduAre you guys ta From: gsh7wfermi.clas.Virginia.EDU (Greg Hennessy) Subject: Re: Keeping Spacecraft on after Funding Cuts. Organization: University of Virginia Lines: 13In article 1r6aqr$dnvaccess.digex.net prb From: oeth6050iscsvax.uni.edu Subject: ****COMIC BOOK SALE**** Organization: University of Northern Iowa Lines: 36Hello,my name is John and I have the following comic books for sale - plea From: shaferrigel.dfrf.nasa.gov (Mary Shafer) Subject: Re: Inner Ear Problems from Too Much Flying? Article-I.D.: rigel.SHAFER.93Apr6095951 Organization: NASA Dryden, Edwards, Cal. Lines: 33 In-Reply Hmm some of these texts do not seem to have much to do with space, let’s set a higher threshold.topic_df[content] X_train sample topic_df[topic_df[23_satellite_space_launch_nasa] 0.15].content.sample(10) for text in sample:print(text[:200]) 嗯其中一些文本似乎与空间没有太大关系让我们设置一个更高的阈值。 topic_df[content] X_train sample topic_df[topic_df[23_satellite_space_launch_nasa] 0.15].content.sample(10) for text in sample:print(text[:200]) rom: genetheporch.raider.net (Gene Wright) Subject: NASA Special Publications for Voyager Mission? Organization: The MacInteresteds of Nashville, Tn. Lines: 12I have two books, both NASA Special P From: 18084TMmsu.edu (Tom) Subject: Billsats X-Added: Forwarded by Space Digest Organization: [via International Space University] Original-Sender: isuVACATION.VENARI.CS.CMU.EDU Distribution: sci Li From: pwwspacsun.rice.edu (Peter Walker) Subject: Re: The Universe and Black Holes, was Re: 2000 years..... Organization: I didnt do it, nobody saw me, you cant prove a thing. Lines: 28In article From: da709cleveland.Freenet.Edu (Stephen Amadei) Subject: Project Help Organization: Case Western Reserve University, Cleveland, Ohio (USA) Lines: 17 NNTP-Posting-Host: hela.ins.cwru.eduHello, From: dbm0000tm0006.lerc.nasa.gov (David B. Mckissock) Subject: Washington Post Article on SSF Redesign News-Software: VAX/VMS VNEWS 1.41 Nntp-Posting-Host: tm0006.lerc.nasa.gov Organization: NAS From: u920496daimi.aau.dk (Hans Erik Martino Hansen) Subject: Commercials on the Moon Organization: DAIMI: Computer Science Department, Aarhus University, Denmark Lines: 16I have often thought abou From: wb8fozskybridge.SCL.CWRU.Edu (David Lesher) Subject: Re: No. Re: Space Marketing would be wonderfull. Organization: NRK Clinic for habitual NetNews abusers - Beltway Annex Lines: 11 Reply-To: w From: 18084TMmsu.edu (Tom) Subject: Solid state vs. tube/analog X-Added: Forwarded by Space Digest Organization: [via International Space University] Original-Sender: isuVACATION.VENARI.CS.CMU.EDU D From: pgfsrl03.cacs.usl.edu (Phil G. Fraering) Subject: Re: Gamma Ray Bursters. positional stuff. Organization: Univ. of Southwestern Louisiana Lines: 24belgarathvax1.mankato.msus.edu writes: From: rnicholscbnewsg.cb.att.com (robert.k.nichols) Subject: Re: Permanaent Swap File with DOS 6.0 dbldisk Summary: PageOverCommitfactor Organization: ATT Lines: 50In article 93059hydra.gatech. 这些似乎确实与空间相关所以让我们保留 0.15 作为阈值。 2.4 分类管道现在我们已经有了如何查看哪些文本与空间相关的规则我们应该将这些知识合并到机器学习管道中以便我们可以在未来的工作或生产中轻松使用。为此我们将使用令人惊叹的人类学习库您可以在其中创建基于规则的组件甚至可以绘制东西这真的很棒。为此我们必须冻结主题模型以便在管道上调用 fit() 时不会对其进行训练 pip install human-learn from hulearn.classification import FunctionClassifier from sklearn.pipeline import make_pipeline# Creating rule for classifying something as a space document def space_rule(df, threshold0.15):return df[23_satellite_space_launch_nasa] threshold# Freezing topic pipeline topic_pipeline.freeze True classifier FunctionClassifier(space_rule) cls_pipeline make_pipeline(topic_pipeline, classifier).fit(X_train) 我们现在有了一个与空间相关的文本的分类器不是很漂亮吗请记住我们甚至还没有触及标签只是使用了主题模型和我们自己的人类直觉。三、评估为了检查这是否确实是一种有效的方法让我们根据测试数据评估我们的分类管道。 from sklearn.metrics import classification_reporty_pred cls_pipeline.predict(X_test) print(classification_report(y_test, y_pred)) precision recall f1-score supportFalse 0.98 0.98 0.98 4475True 0.65 0.70 0.68 237accuracy 0.97 4712macro avg 0.82 0.84 0.83 4712 weighted avg 0.97 0.97 0.97 4712 考虑到我们对标签的查看绝对为零并且数据集非常不平衡这些结果非常好我怀疑我们仍然可以调整这一点并通过更干净的数据、更明智的主题模型选择和潜在的更多主题来获得更好的结果以便我们可以捕获数据中的更多差异。马顿·卡多斯

查看全文

http://www.zqtcl.cn/news/287633/