当前位置:文档之家› WordNet_的同义词典实现同义词检索(C#版)

WordNet_的同义词典实现同义词检索(C#版)

同义词检索应该很多时候会用得上的,举个简单的例子,我们搜索关键字good 的时候,与well 和fine 等的词条也可能是你想要的结果。

这里我们不自己建立同义词库,直接使用WordNet 的同义词库,本篇介绍C# 版的实现步骤,还会有续篇--Java 版。

由于Lucene 是发源于Java,所以C# 的应用者就没有Java 的那么幸福了,Java 版已经有3.0.2 可下载,C# 的版本还必须从SVN 库里:https:///repos/asf/lucene//tags/_2_9_2/ 才能取到最新的 2.9.2 的源码,二制包还只有 2.0 的。

接下来就是用VS 来编译它的,不多说。

只是注意到在contrib 目录中有 解决方案,这是我们想要的,编译 可得到三个可执行文件:1. Syns2Index.exe 用来根据WordNet 的同义词库建立同义词索引文件,同义词本身也是通过Lucene 来查询到的2. SynLookup.exe 从同义词索引中查找某个词有哪些同义词3. SynExpand.exe 与SynLookup 差不多,只是多了个权重值,大概就是同义程度好啦,有了.dll 和上面那三个文件,我们下面来说进一步的步骤:二. 下载WordNet 的同义词库可以从/3.0/ 下载WNprolog-3.0.tar.gz 文件。

然后解压到某个目录,如D:\WNprolog-3.0,其中子目录prolog 中有许多的pl 文件,下面要用到的就是wn_s.pl三. 生成同义词Lucene 索引使用命令Syns2Index.exe d:\WNprolog-3.0\prolog\wn_s.pl syn_index第二个参数是生成索引的目录,由它来帮你创建该目录,执行时间大约40 秒。

这是顺利的时候,也许你也会根本无法成功,执行Syns2Index.exe 的时候出现下面的错误:Unhandled Exception: System.ArgumentException: maxBufferedDocs must at least be 2 when enabledat .Index.IndexWriter.SetMaxBufferedDocs(Int32 maxBufferedDocs)at .Syns2Index.Index(String indexDir, IDictionary word2Nums, IDictionary num2Words)at .Syns2Index.Main(String[] args)莫急,手中有源码,心里不用慌,只要找到Syns2Index 工程,改动Syns2Index.cs 文件中的writer.SetMaxBufferedDocs(writer.GetMaxBufferedDocs() * 2*/); //GetMaxBufferedDocs() 本身就为0,翻多少倍也是白搭为writer.SetMaxBufferedDocs(100); //所以直接改为100 或大于2 的数就行重新使用新编译的Syns2Index.exe 执行上一条命令即可。

成功执行后,可以看到新生成了一个索引目录syn_index, 约3 M。

现在可以用另两个命令来测试一下索引文件:D:\wordnet>SynLookup.exe syn_index hiSynonyms found for "hi":hawaiihellohowdyhulloD:\wordnet>SynExpand.exe syn_index hiQuery: hi hawaii^0.9 hello^0.9 howdy^0.9 hullo^0.9也可以用Luke - Lucene Index ToolBox 来查看索引,两个字段,syn 和word,通过word:hi 就可以搜索到syn:hawaii hello howdy hullo四. 使用同义词分析器、过滤器进行检索相比,Java 程序员要轻松许多,有现成的lucene-wordnet-3.0.2.jar,里面有一些现在的代码可以用。

C# 的那些分析器和过滤器就得自己写了,或许我已走入了一个岔道,但也不算崎岖。

小步骤就不具体描述了,直接上代码,大家从代码中去理解:同义词引擎接口view sourceprint?ing System.Collections.Generic;02.space Com.Unmi.Searching04.{05. /// <summary>06. /// Summary description for ISynonymEngine07. /// </summary>08. public interface ISynonymEngine09. {10. IEnumerable<string> GetSynonyms(string word);11. }12.}同义词引擎实现类view sourceprint?ing System.IO;ing System.Collections.Generic;ing .Analysis;ing .Analysis.Standard;ing .Documents;ing .QueryParsers;ing .Search;ing .Store;09.ing LuceneDirectory = .Store.Directory;ing Version = .Util.Version;12.space Com.Unmi.Searching14.{15. /// <summary>16. /// Summary description for WordNetSynonymEngine17. /// </summary>18. public class WordNetSynonymEngine : ISynonymEngine19. {20.21. private IndexSearcher searcher;22. private Analyzer analyzer = new StandardAnalyzer();23.24. //syn_index_directory 为前面用Syns2Index 生成的同义词索引目录25. public WordNetSynonymEngine(string syn_index_directory)26. {27.28. LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(syn_index_directory));29. searcher = new IndexSearcher(indexDir, true);30. }31.32. public IEnumerable<string> GetSynonyms(string word)33. {34. QueryParser parser = new QueryParser(Version.LUCENE_29, "word", analyzer);35. Query query = parser.Parse(word);36. Hits hits = searcher.Search(query);37.38. //this will contain a list, of lists of words that go together39. List<string> Synonyms = new List<string>();40.41. for (int i = 0; i < hits.Length(); i++)42. {43. Field[] fields = hits.Doc(i).GetFields("syn");44. foreach (Field field in fields)45. {46. Synonyms.Add(field.StringValue());47. }48. }49.50. return Synonyms;51. }52. }53.}过滤器,下面的分析器要用到Lucene 应用WordNet 的同义词典实现同义词检索(C#版) 22010-07-18 10:49view sourceprint?ing System;ing System.Collections.Generic;ing .Analysis;04.space Com.Unmi.Searching06.{07. /// <summary>08. /// Summary description for SynonymFilter09. /// </summary>10. public class SynonymFilter : TokenFilter11. {12. private Queue<Token> synonymTokenQueue = new Queue<Token>();13.14. public ISynonymEngine SynonymEngine { get; private set; }15.16. public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)17. : base(input)18. {19. if (synonymEngine == null)20. throw new ArgumentNullException("synonymEngine");21.22. SynonymEngine = synonymEngine;23. }24.25. public override Token Next()26. {27. // if our synonymTokens queue contains any tokens, return the next one.28. if (synonymTokenQueue.Count > 0)29. {30. return synonymTokenQueue.Dequeue();31. }32.33. //get the next token from the input stream34. Token token = input.Next();35.36. //if the token is null, then it is the end of stream, so return null37. if (token == null)38. return null;39.40. //retrieve the synonyms41. IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(token.TermText());42.43. //if we don't have any synonyms just return the token44. if (synonyms == null)45. {46. return token;47. }48.49. //if we do have synonyms, add them to the synonymQueue,50. // and then return the original token51. foreach (string syn in synonyms)52. {53. //make sure we don't add the same word54. if (!token.TermText().Equals(syn))55. {56. //create the synonymToken57. Token synToken = new Token(syn, token.StartOffset(),58. t.EndOffset(), "<SYNONYM>");59.60. // set the position increment to zero61. // this tells lucene the synonym is62. // in the exact same location as the originating word63. synToken.SetPositionIncrement(0);64.65. //add the synToken to the synonyms queue66. synonymTokenQueue.Enqueue(synToken);67. }68. }69.70. //after adding the syn to the queue, return the original token71. return token;72. }73. }74.}分析器,使用了多个过滤器,当然最主要是用到了上面定义的同义词过滤器view sourceprint?ing .Analysis;ing .Analysis.Standard;03.space Com.Unmi.Searching05.{06. public class SynonymAnalyzer : Analyzer07. {08. public ISynonymEngine SynonymEngine { get; private set; }09.10. public SynonymAnalyzer(ISynonymEngine engine)11. {12. SynonymEngine = engine;13. }14.15. public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)16. {17. //create the tokenizer18. TokenStream result = new StandardTokenizer(reader);19.20. //add in filters21. // first normalize the StandardTokenizer22. result = new StandardFilter(result);23.24. // makes sure everything is lower case25. result = new LowerCaseFilter(result);26.27. // use the default list of Stop Words, provided by the StopAnalyzer class.28. result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);29.30. // injects the synonyms.31. result = new SynonymFilter(result, SynonymEngine);32.33. //return the built token stream.34. return result;35. }36. }37.}最后,当然是要应用上面的同义词引擎和过滤器,分析器了view sourceprint?ing System.IO;ing System.Web;ing .Index;ing System;ing .Analysis.Standard;ing .Documents;ing System.Collections.Generic;ing .Analysis;ing .Search;ing .QueryParsers;ing .Store;ing Version = .Util.Version;ing System.Collections;ing .Highlight;15.ing LuceneDirectory = .Store.Directory;17.space Com.Unmi.Searching19.{20. public class Searcher21. {22. /// <summary>23. /// 假定前面创建的同义词索引目录是d:\indexes\syn_index,24. /// 要搜索的内容索引目录是d:\indexes\file_index, 且索引中有两字段file 和content25. /// IndexEntry 是你自己创建的一个搜索结果类,有两属性file 和fragment26. /// </summary>27. /// <param name="querystring">queryString</param>28. public static List<IndexEntry> Search(queryString)29. {30. //Now SynonymAnalyzer31. ISynonymEngine synonymEngine = new WordNetSynonymEngine(@"d:\indexes\syn_index");32. Analyzer analyzer = new SynonymAnalyzer(synonymEngine);33.34. LuceneDirectory indexDir = FSDirectory.Open(new DirectoryInfo(@"d:\indexes\file_index");35. IndexSearcher searcher = new IndexSearcher(indexDir, true);36.37. QueryParser parser = new QueryParser(Version.LUCENE_29,"content", analyzer);38.39. Query query = parser.Parse(queryString);40.41. Hits hits = searcher.Search(query);42.43. //返回类型是一个IndexEntry 列表,它有两个属性file 和fragment44. List<IndexEntry> entries = new List<IndexEntry>();45.46. //这里还用到了Contrib 里的另一个Lucene 辅助组件,高亮显示搜索关键字47. SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span style='background-color:#23dc23;color:white'>", "</span>"); 48. Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query));49.50. highlighter.SetTextFragmenter(new SimpleFragmenter(256));51. highlighter.SetMaxDocBytesToAnalyze(int.MaxValue);52.53. Analyzer standAnalyzer = new StandardAnalyzer();54.55. for (int i = 0; i < hits.Length(); i++)56. {57. Document doc = hits.Doc(i);58.59. //Any time, can't use the SynonymAnalyzer here60. //注意,这里不能用前面的SynonymAnalyzer 实例,否则将会陷入一系列可怕的循环61. string fragment = highlighter.GetBestFragment(standAnalyzer/*analyzer*/, "content", doc.Get("content"));62.63. IndexEntry entry = new IndexEntry(doc.Get("file"), fragment);64. entries.Add(entry);65. }66.67. return entries;68. }69. }70.}五. 看看同义词检索的效果看前面一大面,也不知道有几人能到达这里,该感性的认识一下,上图看真相:搜索ok,由于fine 是ok 的同义词,所以也被检索到,要有其他同义的结果也能显示出来的。

相关主题