当前位置：首页 > news >正文

网站备案代码生成推广该怎么做

news 2025/11/15 1:14:33

网站备案代码生成,推广该怎么做,网络营销的未来发展趋势论文,南充市住房建设局网站一.lucene原理 Lucene 是apache软件基金会一个开放源代码的全文检索引擎工具包#xff0c;是一个全文检索引擎的架构#xff0c;提供了完整的查询引擎和索引引擎#xff0c;部分文本分析引擎。它不是一个完整的搜索应用程序#xff0c;而是为你的应用程序提供索引和搜索功能…一.lucene原理 Lucene 是apache软件基金会一个开放源代码的全文检索引擎工具包是一个全文检索引擎的架构提供了完整的查询引擎和索引引擎部分文本分析引擎。它不是一个完整的搜索应用程序而是为你的应用程序提供索引和搜索功能。lucene 能够为文本类型的数据建立索引所以你只要能把你要索引的数据格式转化的文本的Lucene 就能对你的文档进行索引和搜索。比如你要对一些 HTML 文档PDF 文档进行索引的话你就首先需要把 HTML 文档和 PDF 文档转化成文本格式的然后将转化后的内容交给 Lucene 进行索引然后把创建好的索引文件保存到磁盘或者内存中最后根据用户输入的查询条件在索引文件上进行查询。搜索应用程序和 Lucene 之间的关系也反映了利用 Lucene 构建搜索应用程序的流程二. 索引和搜索索引是现代搜索引擎的核心建立索引的过程就是把源数据处理成非常方便查询的索引文件的过程。为什么索引这么重要呢试想你现在要在大量的文档中搜索含有某个关键词的文档那么如果不建立索引的话你就需要把这些文档顺序的读入内存然后检查这个文章中是不是含有要查找的关键词这样的话就会耗费非常多的时间想想搜索引擎可是在毫秒级的时间内查找出要搜索的结果的。这就是由于建立了索引的原因你可以把索引想象成这样一种数据结构他能够使你快速的随机访问存储在索引中的关键词进而找到该关键词所关联的文档。Lucene 采用的是一种称为反向索引inverted index的机制。反向索引就是说我们维护了一个词 / 短语表对于这个表中的每个词 / 短语都有一个链表描述了有哪些文档包含了这个词 / 短语。这样在用户输入查询条件的时候就能非常快的得到搜索结果。搜索引擎首先会对搜索的关键词进行解析然后再在建立好的索引上面进行查找最终返回和用户输入的关键词相关联的文档。对于中文用户来说最关心的问题是其是否支持中文的全文检索。由于Lucene良好架构设计对中文的支持只需对其语言词法分析接口进行扩展就能实现对中文检索的支持。三. 索引步骤获取内容 Lucene本身没有提供获取内容的工具或者组件内容是要开发者自己提供相应的程序。这一步包括使用网络爬虫或蜘蛛程序来搜索和界定需要索引的内容。当然数据来源可能包括数据库、分布式文件系统、本地xml等等。lucene作为一款核心搜索库不提供任何功能来实现内容获取。目前有大量的开源爬虫软件可以实现这个功能例如Solrlucene的子项Nutchapache项目包含大规模的爬虫工具抓取和分辨web站点数据Grub比较流行的开源web爬虫工具Heritrix一款开源的Internet文档搜索程序Aperture支持从web站点、文件系统和邮箱中抓取并解析和索引其中的文本数据。建立文档获取原始内容后需要对这些内容进行索引必须将这些内容转换成部件文档。文档主要包括几个带值的域比如标题正文摘要作者和链接。如果文档和域比较重要的话还可以添加权值。设计完方案后需要将原始内容中的文本提取出来写入各个文档这一步可以使用文档过滤器开源项目如Tika实现很好的文档过滤。如果要获取的原始内容存储于数据库中有一些项目通过无缝链接内容获取步骤和文档建立步骤就能轻易地对数据库表进行航所以操作和搜索操作例如DBSightHibernate SearchLuSQLCompass和Oracle/Lucene集成项目。文档分析搜索引擎不能直接对文本进行索引必须将文本分割成一系列被称为语汇单元的独立的原子元素。每一个语汇单元能大致与语言中的“单词”对应起来这个步骤决定文档中的文本域如何分割成语汇单元系列。lucene提供了大量内嵌的分析器可以轻松控制这步操作。文档索引将文档加入到索引列表中。Lucene在这一步骤中提供了强档的API只需简单调用提供的几个方法就可以实现出文档索引的建立。为了提供好的用户体验索引是必须要处理好的一环在设计和定制索引程序时必须围绕如何提高用户的搜索体验来进行。四. 搜索组件搜索组件即为输入搜索短语然后进行分词然从索引中查找单词从而找到包含该单词的文档。搜索质量由查准率和查全率来衡量。搜索组件主要包括以下内容用户搜索界面主要是和用户进行交互的页面也就是呈现在浏览器中能看到的东西这里主要考虑的就是页面UI设计了。一个良好的UI设计是吸引用户的重要组成部分。建立查询建立查询主要是指用户输入所要查询的短语以普通HTML表单或者ajax的方式提交到后台服务器端。然后把词语传递给后台搜索引擎。这就是一个简单建立查询的过程。搜索查询即为查询检索索引然后返回与查询词语匹配的文档。然后把返回来的结构按照查询请求来排序。搜索查询组件覆盖了搜索引擎中大部分的复杂内容。展现结果所谓展现结果和第一个搜索界面类似。都是一个与用户交互的前端展示页面作为一个搜索引擎用户体验永远是第一位。其中前端展示在用户体现上占据了重要地位。五. 官网实例解析 Lucene的使用主要体现在两个步骤创建索引通过IndexWriter对不同的文件进行索引的创建并将其保存在索引相关文件存储的位置中。通过索引查寻关键字相关文档。下面针对官网上面给出的一个例子进行分析 Analyzer analyzer new StandardAnalyzer(Version.LUCENE_CURRENT);// Store the index in memory:Directory directory new RAMDirectory();// To store an index on disk, use this instead://Directory directory FSDirectory.open(/tmp/testindex);IndexWriterConfig config new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);IndexWriter iwriter new IndexWriter(directory, config);Document doc new Document();String text This is the text to be indexed.;doc.add(new Field(fieldname, text, TextField.TYPE_STORED));iwriter.addDocument(doc);iwriter.close();// Now search the index:DirectoryReader ireader DirectoryReader.open(directory);IndexSearcher isearcher new IndexSearcher(ireader);// Parse a simple query that searches for text:QueryParser parser new QueryParser(Version.LUCENE_CURRENT, fieldname, analyzer);Query query parser.parse(text);ScoreDoc[] hits isearcher.search(query, null, 1000).scoreDocs;assertEquals(1, hits.length);// Iterate through the results:for (int i 0; i hits.length; i) {Document hitDoc isearcher.doc(hits[i].doc);assertEquals(This is the text to be indexed., hitDoc.get(fieldname));}ireader.close();directory.close();索引的创建首先我们需要定义一个词法分析器。比如一句话“我爱我们的中国”如何对他拆分扣掉停顿词“的”提取关键字“我”“我们”“中国”等等。这就要借助的词法分析器Analyzer来实现。这里面使用的是标准的词法分析器如果专门针对汉语还可以搭配paoding进行使用。 1 Analyzer analyzer new StandardAnalyzer(Version.LUCENE_CURRENT); 参数中的Version.LUCENE_CURRENT代表使用当前的Lucene版本本文环境中也可以写成Version.LUCENE_40。第二步确定索引文件存储的位置Lucene提供给我们两种方式 1 本地文件存储 Directory directory FSDirectory.open(/tmp/testindex); 2 内存存储 Directory directory new RAMDirectory(); 可以根据自己的需要进行设定。第三步创建IndexWriter进行索引文件的写入。 IndexWriterConfig config new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); IndexWriter iwriter new IndexWriter(directory, config); 这里的IndexWriterConfig据官方文档介绍是对indexWriter的配置其中包含了两个参数第一个是目前的版本第二个是词法分析器Analyzer。　第四步内容提取进行索引的存储。 Document doc new Document(); String text This is the text to be indexed.; doc.add(new Field(fieldname, text, TextField.TYPE_STORED)); iwriter.addDocument(doc); iwriter.close(); 第一行申请了一个document对象这个类似于数据库中的表中的一行。第二行是我们即将索引的字符串。第三行把字符串存储起来因为设置了TextField.TYPE_STORED,如果不想存储可以使用其他参数详情参考官方文档并存储“表明”为fieldname. 第四行把doc对象加入到索引创建中。第五行关闭IndexWriter,提交创建内容。这就是索引创建的过程。关键字查询第一步打开存储位置 DirectoryReader ireader DirectoryReader.open(directory); 第二步创建搜索器 IndexSearcher isearcher new IndexSearcher(ireader); 第三步类似SQL进行关键字查询 QueryParser parser new QueryParser(Version.LUCENE_CURRENT, fieldname, analyzer); Query query parser.parse(text); ScoreDoc[] hits isearcher.search(query, null, 1000).scoreDocs; assertEquals(1, hits.length); for (int i 0; i hits.length; i) {Document hitDoc isearcher.doc(hits[i].doc);assertEquals(This is the text to be indexed.,hitDoc.get(fieldname)); } 这里我们创建了一个查询器并设置其词法分析器以及查询的“表名“为”fieldname“。查询结果会返回一个集合类似SQL的ResultSet我们可以提取其中存储的内容。关于各种不同的查询方式可以参考官方手册或者推荐的PPT 第四步关闭查询器等。 ireader.close(); directory.close(); 自己实现的一个小实例对一个文件夹内的内容进行索引的创建并根据关键字筛选文件并读取其中的内容。 package cn.lnu.edu.yxk; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.Date; import java.util.List;import jxl.Cell; import jxl.Sheet; import jxl.Workbook; import jxl.read.biff.BiffException;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.usermodel.Range; /*** 对一个文件夹内的内容进行索引的创建并根据关键字筛选文件读取其中的内容。* author yxk**/ public class IndexManager {private static String content ;//文件里面的内容private static String INDEX_DIR D:\\test\\luceneIndex;//索引创建的存储目录private static String DATA_DIR D:\\test\\luceneData;//文件夹的目录private static Analyzer analyzer null;//词法分析器private static Directory directory null;//索引文件存储的位置private static IndexWriter indexWriter null;//创建索引器索引文件的写入/*** 创建当前文件目录的索引* param path当前目录的文件* return 返回是否创建成功*/public static Boolean createIndex(String path) {Date date1 new Date();//创建需要的时间ListFile files listFile(path);// 获取指定目录下得所有符合条件的文件// 获取文件的内容for (File file : files) {content ;//通过文件类型获取文件的内容String type file.getName().substring(file.getName().lastIndexOf(.) 1);if (txt.equalsIgnoreCase(type)) {content txt2String(file);} else if (doc.equalsIgnoreCase(type)) {content doc2String(file);} else if (xls.equalsIgnoreCase(type)) {content xls2String(file);}System.out.println(namefile.getName());System.out.println(pathfile.getPath());//System.out.println(file.getName().getBytes().toString());System.out.println();try {analyzer new StandardAnalyzer();//词法分析器directory FSDirectory.open(new File(INDEX_DIR).toPath());//索引创建存储的位置// System.out.println(ssss// new File(INDEX_DIR).toPath().toString());//自动创建索引目录File indexFile new File(INDEX_DIR);if (!indexFile.exists()) {indexFile.mkdirs();}//索引文件的写入IndexWriterConfig config new IndexWriterConfig(analyzer);indexWriter new IndexWriter(directory, config);/** 内容提取进行索引的存储*///申请了一个document对象这个类似于数据库中的表中的一行。Document document new Document();//把字符串存储起来因为设置了TextField.TYPE_STORED,如果不想存储可以使用其他参数详情参考官方文档并存储“表明”为fieldname.document.add(new org.apache.lucene.document.TextField(filename, file.getName(), Field.Store.YES));//文件名索引创建document.add(new org.apache.lucene.document.TextField(content, content, Field.Store.YES));//文件内容索引创建document.add(new org.apache.lucene.document.TextField(path,file.getPath(), Field.Store.YES));//文件路径索引的创建//把document对象加入到索引创建中indexWriter.addDocument(document);//关闭IndexWriter,提交创建内容。indexWriter.commit();closeWriter();} catch (IOException e) {e.printStackTrace();}content ;}Date date2 new Date();System.out.println(创建索引-----耗时 (date2.getTime() - date1.getTime()) ms\n);return true;}/*** 查询索引,返回符合条件的文件* * param 查询的字符串* return 符合条件的结果* throws IOException*/public static void serarchIndex(String text) {Date date1 new Date();try {//打开存储位置directory FSDirectory.open(new File(INDEX_DIR).toPath());analyzer new StandardAnalyzer();DirectoryReader ireader DirectoryReader.open(directory);//创建搜索器IndexSearcher isearcher new IndexSearcher(ireader);/** 类似SQL进行关键字查询*/QueryParser parser new QueryParser(content, analyzer);Query query parser.parse(text);//创建了一个查询器并设置其词法分析器以及查询的“表名“为”fieldname“。查询结果会返回一个集合类似SQL的ResultSet我们可以提取其中存储的内容。ScoreDoc[] hits isearcher.search(query, 1000).scoreDocs;for (int i 0; i hits.length; i) {Document hitDoc isearcher.doc(hits[i].doc);System.out.println(-----------);System.out.println(hitDoc.get(filename));System.out.println(hitDoc.get(content));System.out.println(hitDoc.get(path));System.out.println(------------);}//关闭查询器ireader.close();directory.close();} catch (IOException e) {e.printStackTrace();} catch (ParseException e) {e.printStackTrace();}Date date2 new Date();System.out.println(关键字查询-----耗时 (date2.getTime() - date1.getTime()) ms\n);}/*** * throws IOException*/private static void closeWriter() throws IOException {if (indexWriter ! null)indexWriter.close();}/*** 读取xls文件内容引入jxl.jar类型的包* param file* return 返回内容*/private static String xls2String(File file) {String result ;try {FileInputStream fis new FileInputStream(file);StringBuilder sb new StringBuilder();jxl.Workbook rwb Workbook.getWorkbook(fis);Sheet[] sheet rwb.getSheets();for (int i 0; i sheet.length; i) {Sheet rs rwb.getSheet(i);for (int j 0; i rs.getRows(); j) {Cell[] cells rs.getRow(j);for (int k 0; k cells.length; k) {sb.append(cells[k].getContents());}}}fis.close();result sb.toString();} catch (FileNotFoundException e) {e.printStackTrace();} catch (BiffException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return result;}/*** 读取doc类型文件的内容通过poi.jar* param file的类型* return 返回文件的内容*/private static String doc2String(File file) {String result ;try {FileInputStream fis new FileInputStream(file);//文件输入流HWPFDocument document new HWPFDocument(fis);Range range document.getRange();result range.text();fis.close();} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return result;}/*** 读取txt文件的内容* * param file想要读取的文件类型* return 返回文件内容*/private static String txt2String(File file) {String result ;try {BufferedReader reader new BufferedReader(new FileReader(file));String s ;while ((s reader.readLine()) ! null) {result result \n s;}reader.close();} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return result;}/*** 过滤当前目录下得文件* param path 当前目录下得文件* return 返回符合条件的文件*/private static ListFile listFile(String path) {File[] files new File(path).listFiles();ListFile fileList new ArrayListFile();for (File file : files) {if (isTxtFile(file.getName())) {fileList.add(file);}}return fileList;}/*** 判断是否为目标文件支持的格式为.txt,.doc,.xls文件格式如果是文件类型满足过滤条件返回true否则返回false* param name 根据文件名的后缀* return 是否符合格式规范*/private static boolean isTxtFile(String name) {if (name.lastIndexOf(.txt) 0)return true;else if (name.lastIndexOf(.doc) 0)return true;else if (name.lastIndexOf(.xls) 0)return true;return false;}public static void main(String[] args) {//创建索引目录运行一次重新创建一次File fileIndex new File(INDEX_DIR);if (deleteIndex(fileIndex)) {fileIndex.mkdir();} else {fileIndex.mkdir();}//创建索引文件createIndex(DATA_DIR);//通过关键字查询serarchIndex(中华);}/*** 删除文件目录下得所有文件* * param fileIndex 当前索引目录下得文件* return 返回是否删除重新创建*/private static boolean deleteIndex(File fileIndex) {if (fileIndex.isDirectory()) {File[] files fileIndex.listFiles();for (int i 0; i files.length; i) {deleteIndex(files[i]);}}fileIndex.delete();return true;}}通过对几位博文的分析进行总结原文博客出自http://blog.csdn.net/csh624366188/article/category/895342 和https://www.cnblogs.com/xing901022/p/3933675.html --------------------- 作者jofjhh 来源CSDN 原文https://blog.csdn.net/m0_37913549/article/details/78989078 版权声明本文为作者原创文章转载请附上博文链接内容解析ByCSDN,CNBLOG博客文章一键转载插件

查看全文

http://www.zqtcl.cn/news/404407/